🤖 AI Summary
To address the limitation of purely vision-based gaze estimation—namely, its insensitivity to directional priors—this paper proposes the first text-guided cross-modal gaze estimation paradigm. Methodologically, we design a text–face co-modeling framework featuring a direction-aware text generator that automatically produces natural-language gaze direction descriptions, and a CLIP-driven fine-grained multimodal fusion module that enables heterogeneous feature alignment and cross-modal attention integration. Our core contribution lies in injecting linguistically grounded directional semantics as structured priors into visual gaze modeling, thereby overcoming representational bottlenecks inherent in unimodal approaches. Extensive experiments demonstrate state-of-the-art performance on three major benchmarks—MPIIGaze, EyeDiap, and Gaze360—significantly outperforming existing vision-only methods. To foster reproducibility and further research, we will publicly release our code and pre-trained models.
📝 Abstract
Visual gaze estimation, with its wide-ranging application scenarios, has garnered increasing attention within the research community. Although existing approaches infer gaze solely from image signals, recent advances in visual-language collaboration have demonstrated that the integration of linguistic information can significantly enhance performance across various visual tasks. Leveraging the remarkable transferability of large-scale Contrastive Language-Image Pre-training (CLIP) models, we address the open and urgent question of how to effectively apply linguistic cues to gaze estimation. In this work, we propose GazeCLIP, a novel gaze estimation framework that deeply explores text-face collaboration. Specifically, we introduce a meticulously designed linguistic description generator to produce text signals enriched with coarse directional cues. Furthermore, we present a CLIP-based backbone adept at characterizing text-face pairs for gaze estimation, complemented by a fine-grained multimodal fusion module that models the intricate interrelationships between heterogeneous inputs. Extensive experiments on three challenging datasets demonstrate the superiority of GazeCLIP, which achieves state-of-the-art accuracy. Our findings underscore the potential of using visual-language collaboration to advance gaze estimation and open new avenues for future research in multimodal learning for visual tasks. The implementation code and the pre-trained model will be made publicly available.