Wu, Qi

Multimodal Communication for Embodied Human-Robot Interaction with Natural Gestures

2021

Abstract

Communication takes place in various forms and is an essential part of human-human communication. Researchers have done a plethora of studies to understand it biologically and computationally. Likewise, it also plays a significant role in Human-Robot Interaction (HRI) in order to endow Artificial Intelligence (AI) systems with humanlike cognition and sociality. With the advancement of realistic simulators, multimodal HRI is also a hotspot with embodied agents in the simulation. Human users should be able to manipulate or collaborate with embodied agents in multimodal ways, including verbal and non-verbal methods. Up to now, most prior works with embodied AI have been focusing on addressing embodied agent tasks using verbal cues along with visual perceptions, e.g., using human languages as natural language instructions to assist embodied visual navigation tasks. Nonetheless, nonverbal means of communication like gestures, which are rooted in human communication, are rarely examined in embodied agent tasks. In this dissertation, I contemplate on existing research topics in embodied AI and propose to tackle embodied visual navigation tasks with natural human gestures to fill the deficiency of non-verbal communicative interfaces in embodied HRI. It has the following contributions:

- To this end, I first develop a 3D photo-realistic simulation environment, Gesture-based THOR (GesTHOR). In this simulator, the human user can wear a Virtual Reality (VR) Head-Mounted Display (HMD) to be immersed as a humanoid agent in the same environment as the robot agent and communicate with the robot interactively using instructional gestures by sensory devices that can track body and hand motions. I provide data collection tools so that users can generate their own gesture data in our simulation environment.- I created Gesture ObjectNav Dataset (GOND) and standardized benchmarks to evaluate how gestures contribute to the embodied object navigation task. This dataset contains natural gestures collected from human users and object navigation tasks defined in GesTHOR. - To demonstrate the effectiveness of gestures for embodied navigation, I build an end-to-end Reinforcement Learning (RL) model that can integrate multimodal perceptions for the robot to learn the optimal navigation policies. Through cases studies with GOND in GesTHOR, I illustrate that the robot agent can perform the navigation task successfully and efficiently with gestures instead of natural languages. I also show that the navigation agent can learn the underlined semantics of unpredefined gestures, which is beneficial to its navigation.

By introducing GesTHOR and GOND as well as related experimental results, I aim to spur growing interest in embodied HRI with non-verbal communicative interfaces toward building cognitive AI systems.

Main Content

For improved accessibility of PDF content, download the file to your device.

UCLA

Multimodal Communication for Embodied Human-Robot Interaction with Natural Gestures