CLIP-driven Dual Feature Enhancing Network for Gaze Estimation

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low gaze estimation accuracy and poor cross-domain generalization in complex scenarios, this paper proposes a CLIP-based dual-feature enhancement network. Methodologically, we introduce a novel “primary–auxiliary” collaborative enhancement paradigm: the CLIP image encoder serves as the primary backbone to strengthen visual representation generalizability, while— for the first time—the CLIP text encoder is leveraged to construct a language-driven discrepancy module (LDM) that models semantic differences in eye movement patterns. Complementing this, a vision-driven fusion module (VFM) enables multi-granularity feature alignment, and a dual-head regressor refines gaze direction prediction. Extensive intra-domain and cross-domain evaluations across four challenging benchmarks demonstrate substantial improvements in both accuracy and robustness, achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
The complex application scenarios have raised critical requirements for precise and generalizable gaze estimation methods. Recently, the pre-trained CLIP has achieved remarkable performance on various vision tasks, but its potentials have not been fully exploited in gaze estimation. In this paper, we propose a novel CLIP-driven Dual Feature Enhancing Network (CLIP-DFENet), which boosts gaze estimation performance with the help of CLIP under a novel `main-side' collaborative enhancing strategy. Accordingly, a Language-driven Differential Module (LDM) is designed on the basis of the CLIP's text encoder to reveal the semantic difference of gaze. This module could empower our Core Feature Extractor with the capability of characterizing the gaze-related semantic information. Moreover, a Vision-driven Fusion Module (VFM) is introduced to strengthen the generalized and valuable components of visual embeddings obtained via CLIP's image encoder, and utilizes them to further improve the generalization of the features captured by Core Feature Extractor. Finally, a robust Double-head Gaze Regressor is adopted to map the enhanced features to gaze directions. Extensive experimental results on four challenging datasets over within-domain and cross-domain tasks demonstrate the discriminability and generalizability of our CLIP-DFENet.
Problem

Research questions and friction points this paper is trying to address.

Enhancing gaze estimation precision
Utilizing CLIP for semantic gaze differences
Improving cross-domain gaze generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

CLIP-driven feature enhancement
Language-driven Differential Module
Vision-driven Fusion Module
🔎 Similar Papers
No similar papers found.
L
Lin Zhang
Beijing Jiaotong University
Yi Tian
Yi Tian
XJTLU
computer vision,Medical Image processing
W
Wanru Xu
Beijing Jiaotong University
Y
Yi Jin
Beijing Jiaotong University
Yaping Huang
Yaping Huang
Beijing Jiaotong University