🤖 AI Summary
This work addresses the limited generalization of existing appearance-based gaze estimation methods in open-domain scenarios—such as those involving eyeglasses or varying illumination—stemming from insufficient training data diversity and inconsistent labels across datasets. To overcome these challenges without requiring additional manual annotations, we propose a lightweight framework that enhances data diversity through synthetic augmentation with glasses, face masks, and complex lighting conditions. The gaze regression task is reformulated as a multi-task learning problem, integrating multi-view supervised contrastive learning, discrete label classification, and eye-region segmentation. Despite having less than 1% of the parameters of the state-of-the-art UniGaze-H model, our approach achieves comparable generalization performance. Furthermore, we introduce the first robustness evaluation benchmark for gaze estimation under challenging real-world conditions, enabling high-accuracy, real-time tracking on mobile devices.
📝 Abstract
Appearance-based gaze estimation (AGE) has achieved remarkable performance in constrained settings, yet we reveal a significant generalization gap where existing AGE models often fail in practical, unconstrained scenarios, particularly those involving facial wearables and poor lighting conditions. We attribute this failure to two core factors: limited image diversity and inconsistent label fidelity across different datasets, especially along the pitch axis. To address these, we propose a robust AGE framework that enhances generalization without requiring additional human-annotated data. First, we expand the image manifold via an ensemble of augmentation techniques, including synthesis of eyeglasses, masks, and varied lighting. Second, to mitigate the impact of anisotropic inter-dataset label deviation, we reformulate gaze regression as a multi-task learning problem, incorporating multi-view supervised contrastive (SupCon) learning, discretized label classification, and eye-region segmentation as auxiliary objectives. To rigorously validate our approach, we curate new benchmark datasets designed to evaluate gaze robustness under challenging conditions, a dimension largely overlooked by existing evaluation protocols. Our MobileNet-based lightweight model achieves generalization performance competitive with the state-of-the-art (SOTA) UniGaze-H, while utilizing less than 1\% of its parameters, enabling high-fidelity, real-time gaze tracking on mobile devices.