Enhancing 3D Gaze Estimation in the Wild using Weak Supervision with Gaze Following Labels

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In unconstrained settings, 3D gaze estimation suffers from performance degradation due to appearance variations, large head pose diversity, severe occlusions, and scarcity of authentic 3D annotations. To address this, we propose ST-WSGE, a self-training weakly supervised framework. ST-WSGE leverages abundant 2D gaze-following data to generate high-quality 3D pseudo-labels and introduces the first modality-agnostic Gaze Transformer (GaT), unifying static and dynamic gaze representation learning across images and videos. Key contributions include: (1) the first gaze-following–driven weakly supervised mechanism for 3D pseudo-label generation; (2) the first multimodal GaT architecture enabling joint image-video gaze modeling; and (3) novel cross-domain alignment and knowledge distillation strategies. ST-WSGE achieves state-of-the-art performance on Gaze360 and GFIE benchmarks, and significantly outperforms prior front-end methods on MPIIFaceGaze and Gaze360 in both cross-domain transfer and video-based gaze estimation tasks.

Technology Category

Application Category

📝 Abstract
Accurate 3D gaze estimation in unconstrained real-world environments remains a significant challenge due to variations in appearance, head pose, occlusion, and the limited availability of in-the-wild 3D gaze datasets. To address these challenges, we introduce a novel Self-Training Weakly-Supervised Gaze Estimation framework (ST-WSGE). This two-stage learning framework leverages diverse 2D gaze datasets, such as gaze-following data, which offer rich variations in appearances, natural scenes, and gaze distributions, and proposes an approach to generate 3D pseudo-labels and enhance model generalization. Furthermore, traditional modality-specific models, designed separately for images or videos, limit the effective use of available training data. To overcome this, we propose the Gaze Transformer (GaT), a modality-agnostic architecture capable of simultaneously learning static and dynamic gaze information from both image and video datasets. By combining 3D video datasets with 2D gaze target labels from gaze following tasks, our approach achieves the following key contributions: (i) Significant state-of-the-art improvements in within-domain and cross-domain generalization on unconstrained benchmarks like Gaze360 and GFIE, with notable cross-modal gains in video gaze estimation; (ii) Superior cross-domain performance on datasets such as MPIIFaceGaze and Gaze360 compared to frontal face methods. Code and pre-trained models will be released to the community.
Problem

Research questions and friction points this paper is trying to address.

Improves 3D gaze estimation accuracy
Utilizes weak supervision with gaze following
Enhances cross-domain generalization performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Training Weakly-Supervised Gaze Estimation
Gaze Transformer modality-agnostic architecture
Combining 3D video with 2D labels
🔎 Similar Papers
No similar papers found.