Designing embodied virtual environments where humans can naturally communicate through verbal and non-verbal channels is a challenging problem. It requires a deep understanding of human communication behaviors and patterns, and sophisticated models that can project this knowledge onto digital avatars. Non-verbal gestures and cues are the dominant channel for conveying information in interpersonal interactions. In this dissertation, we explore sub-problems in the domain of understanding human communication patterns, generating digital avatar behaviors and studying how humans communication patterns vary in digitally embodied environments.
We first present a study comparing group interactions in Virtual Reality and Videoconferencing settings. During interactions, people were able to achieve similar performance across tasks, however their gaze and other nonverbal behavior patterns varied in VR and VC settings. Findings of this study inform how sharing an embodied 3D environment impacts our ways of communicating information. Significant behavioral differences are observed. These include increased activity in videoconference related to maintaining the social connection: more person directed gaze and increased verbal and nonverbal backchannel behavior. Videoconference also had reduced conversational overlap, increased self-adaptor gestures and reduced deictic gestures as compared with embodied VR.
We then explore the design of a pedagogical virtual agent, that takes students through a discovery-based learning environment, the Mathematical Imagery Trainer for Proportionality (MITp). The agent helps foster insights about concepts of ratios and proportions by helping students through well crafted set of tutorials. It is a challenging task for agent technology as the amount of concrete feedback from the learner is very limited, here restricted to the location of two markers on the screen. A Dynamic Decision Network is used to automatically determine agent behavior, based on a deep understanding of the tutorial protocol. A pilot evaluation showed that all participants developed movement schemes supporting proto-proportional reasoning. They were also able to provide verbal proto-proportional expressions for one of the taught strategies.
As a sub-problem in understanding gesture based human communication, we also explore the Sign Language Recognition (SLR) problem. Isolated Sign Language Recognition (ISLR) is an important constituent task in developing SLR systems in which videos of individual, word-level signs are correctly identified. We develop a novel model for ISLR based on a unique G3D-Attend module that further uses spatial, temporal and channel self-attentions to contextualize aggregated spatial and temporal dependencies. We augment the datasets with 2D and 3D skeleton data, which is used along with RGB data in an ensemble-based approach to achieve state-of-the-art recognition rates. We then extend the approach to create weak sign spotting labels. Spotting involves identifying individual signs in multi-sign sentences and is a significantly more difficult task due to co-articulation effects, differences in signing speeds and the influence of contextual information on sign production. Generating sign labels in continuous domain manually is a laborious task, and this approach can provide a set of labels that can then be used to provide weak supervision for future machine learning applications.