LISA: Language-guided Interference-aware Spatial-Frequency Attention for Driver Gaze Estimation

📅 2026-05-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

237K/year
🤖 AI Summary
This work addresses the degradation in driver gaze estimation accuracy caused by abrupt illumination changes, sensor noise, and irrelevant visual attributes. To this end, the authors propose LISA, a novel framework that uniquely integrates frequency-domain priors with vision-language knowledge through a disturbance-aware spatial-frequency attention mechanism. LISA fuses spatial and frequency domains to inject low-frequency semantic stability into high-frequency details while employing spatial attention to focus on eye regions. Furthermore, it leverages a frozen CLIP encoder combined with orthogonal regularization to disentangle gaze-relevant features from appearance-related distractions during training. Evaluated on two benchmark datasets, LISA demonstrates significantly enhanced robustness against occlusions and illumination variations, achieving state-of-the-art performance.
📝 Abstract
Driver gaze estimation serves as a fundamental metric for evaluating driver attentiveness in modern monitoring systems. Beyond being vulnerable to sudden lighting changes and sensor noise, spatial-domain models struggle to disentangle authentic gaze cues from irrelevant visual attributes. In this paper, we propose LISA, a \textbf{L}anguage-guided \textbf{I}nterference-aware \textbf{S}patial-Frequency \textbf{A}ttention framework that combines frequency-domain priors with vision-language knowledge. Observing that the amplitude spectrum remains relatively stable even under spatial perturbations, we design a dual-domain fusion mechanism. It integrates stable low-frequency semantics into high-frequency details, employing spatial attention to precisely target ocular regions. To reduce semantic ambiguity, we also introduce a training-time disentanglement strategy. Using a frozen CLIP encoder and orthogonal regularization, we explicitly separate gaze features from appearance interference. Experiments on two benchmarks show that LISA achieves state-of-the-art performance, with significantly improved robustness against occlusions and lighting variations. The code repository is available at https://github.com/Mason-bupt/LISA.
Problem

Research questions and friction points this paper is trying to address.

driver gaze estimation
spatial interference
lighting variations
semantic ambiguity
visual attributes
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial-frequency attention
vision-language guidance
frequency-domain priors
feature disentanglement
gaze estimation
🔎 Similar Papers
No similar papers found.