OmniGaze: Reward-inspired Generalizable Gaze Estimation In The Wild

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing 3D gaze estimation methods suffer from limited generalizability due to scarce annotated data and substantial inter-domain distribution shifts. To address this, we propose a cross-domain robust gaze estimation framework tailored for unconstrained real-world scenarios. First, we mitigate domain shift by leveraging multi-source unlabeled data. Second, we design a reward-model-based pseudo-label evaluation mechanism that jointly incorporates visual encodings, semantic prompts (generated by a multimodal large language model), and 3D gaze direction vectors to quantitatively score pseudo-label quality and enable weighted semi-supervised learning. Third, we build a scalable data engine for continuous improvement. Our method achieves state-of-the-art performance both in-domain and cross-domain across five benchmark datasets. Crucially, it demonstrates strong zero-shot generalization on four entirely unseen datasets—validating its exceptional adaptability to previously unobserved domains.

Technology Category

Application Category

📝 Abstract

Current 3D gaze estimation methods struggle to generalize across diverse data domains, primarily due to i) the scarcity of annotated datasets, and ii) the insufficient diversity of labeled data. In this work, we present OmniGaze, a semi-supervised framework for 3D gaze estimation, which utilizes large-scale unlabeled data collected from diverse and unconstrained real-world environments to mitigate domain bias and generalize gaze estimation in the wild. First, we build a diverse collection of unlabeled facial images, varying in facial appearances, background environments, illumination conditions, head poses, and eye occlusions. In order to leverage unlabeled data spanning a broader distribution, OmniGaze adopts a standard pseudo-labeling strategy and devises a reward model to assess the reliability of pseudo labels. Beyond pseudo labels as 3D direction vectors, the reward model also incorporates visual embeddings extracted by an off-the-shelf visual encoder and semantic cues from gaze perspective generated by prompting a Multimodal Large Language Model to compute confidence scores. Then, these scores are utilized to select high-quality pseudo labels and weight them for loss computation. Extensive experiments demonstrate that OmniGaze achieves state-of-the-art performance on five datasets under both in-domain and cross-domain settings. Furthermore, we also evaluate the efficacy of OmniGaze as a scalable data engine for gaze estimation, which exhibits robust zero-shot generalization on four unseen datasets.

Problem

Research questions and friction points this paper is trying to address.

Addressing poor cross-domain generalization in 3D gaze estimation

Mitigating domain bias using large-scale unlabeled real-world data

Enhancing pseudo-label reliability via reward models and semantic cues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-supervised framework using unlabeled facial images

Reward model assesses pseudo-label reliability with embeddings

Multimodal cues compute confidence scores for loss weighting

🔎 Similar Papers

No similar papers found.

Authors to Follow