LISA: Language-guided Interference-aware Spatial-Frequency Attention for Driver Gaze Estimation

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work addresses the degradation in driver gaze estimation accuracy caused by abrupt illumination changes, sensor noise, and irrelevant visual attributes. To this end, the authors propose LISA, a novel framework that uniquely integrates frequency-domain priors with vision-language knowledge through a disturbance-aware spatial-frequency attention mechanism. LISA fuses spatial and frequency domains to inject low-frequency semantic stability into high-frequency details while employing spatial attention to focus on eye regions. Furthermore, it leverages a frozen CLIP encoder combined with orthogonal regularization to disentangle gaze-relevant features from appearance-related distractions during training. Evaluated on two benchmark datasets, LISA demonstrates significantly enhanced robustness against occlusions and illumination variations, achieving state-of-the-art performance.

📝 Abstract

Driver gaze estimation serves as a fundamental metric for evaluating driver attentiveness in modern monitoring systems. Beyond being vulnerable to sudden lighting changes and sensor noise, spatial-domain models struggle to disentangle authentic gaze cues from irrelevant visual attributes. In this paper, we propose LISA, a \textbf{L}anguage-guided \textbf{I}nterference-aware \textbf{S}patial-Frequency \textbf{A}ttention framework that combines frequency-domain priors with vision-language knowledge. Observing that the amplitude spectrum remains relatively stable even under spatial perturbations, we design a dual-domain fusion mechanism. It integrates stable low-frequency semantics into high-frequency details, employing spatial attention to precisely target ocular regions. To reduce semantic ambiguity, we also introduce a training-time disentanglement strategy. Using a frozen CLIP encoder and orthogonal regularization, we explicitly separate gaze features from appearance interference. Experiments on two benchmarks show that LISA achieves state-of-the-art performance, with significantly improved robustness against occlusions and lighting variations. The code repository is available at https://github.com/Mason-bupt/LISA.

Problem

Research questions and friction points this paper is trying to address.

driver gaze estimation

spatial interference

lighting variations

semantic ambiguity

visual attributes

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial-frequency attention

vision-language guidance

frequency-domain priors