The human gaze serves as a potential non-verbal cue that enhances human-computer interfaces, enabling users to engage with devices through eye movements. The ability to accurately measure and interpret gaze direction plays a critical role in various domains, including social interactions, assistive technologies, augmented reality, and psychological research to examine cognitive state.
Over the past decade, gaze estimation has emerged as a prominent area of interest within the research community. Conventional gaze estimation methods rely on specialized hardware, including high-resolution cameras, infrared light sources, and image processing units, to detect eye features like the pupil center and iris boundary. While these devices offer greater accuracy and precision, their practical use is limited by factors such as high costs, restricted head movements, and limited range of allowable distances between user and device. As an alternative to dedicated gaze-tracking hardware, several techniques have been developed to infer gaze direction directly from eye images captured by standard cameras on personal devices such as laptops, tablets, and phones.
The recent emergence of deep learning techniques has enhanced learning-based gaze estimation approaches. These appearance-based gaze estimation methods directly map eye images to gaze targets without the need for explicit detection of eye features and, therefore, have a strong capability to work in unconstrained environments. However, the effectiveness of these approaches greatly depends on having access to extensive training datasets that include a variety of eye appearances, gaze directions, head poses, lighting conditions, and other variables. In this thesis, we focus on improving the adaptability and effectiveness of webcam-based gaze estimation techniques through the application of generative modeling and representation learning.
First, we propose an easy approach for calibrating a laptop camera with a commercial gaze tracker, streamlining the process of collecting labeled gaze data to make it readily accessible for all users. This dataset can then be utilized to enhance the accuracy of appearance-based gaze estimation methods for new users and different domains.
Second, we introduce a generative redirection framework designed to manipulate gaze direction and head pose orientation in synthesized images. This framework is used to generate augmented, gaze-labeled datasets, thereby enhancing the performance of gaze estimation methods.
Third, we explore self-supervised contrastive learning to acquire equivariant gaze representations through an unlabeled multiview dataset. These gaze-specific representations are utilized for few-shot gaze estimation, enhancing the efficacy of user-specific models.
Finally, we present a spatiotemporal model for video-based gaze estimation, incorporating attention modules to enhance understanding of both local spatial and global temporal dynamics. Furthermore, we improve the performance of this model using person-specific few-shot learning through Gaussian processes.