FSDAM: Few-Shot Driving Attention Modeling via Vision-Language Coupling

📅 2025-11-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of driver attention understanding and interpretation in driving scenarios. We propose a few-shot joint modeling approach requiring only approximately 100 annotated samples. Our method employs a vision–language dual-path coupled network to simultaneously perform spatial attention prediction and natural language caption generation, with cross-modal contrastive learning enforcing semantic alignment between the two modalities. Key contributions include: (1) a lightweight dual-path architecture that jointly optimizes attention localization and interpretable text generation under extreme label scarcity; (2) strong zero-shot generalization—achieving attention prediction performance on par with fully supervised methods across multiple driving benchmarks, while generating semantically coherent, context-aware captions; and (3) significantly enhanced interpretability of human intent and improved human-factor modeling capability for autonomous driving systems.

Technology Category

Application Category

📝 Abstract
Understanding where drivers look and why they shift their attention is essential for autonomous systems that read human intent and justify their actions. Most existing models rely on large-scale gaze datasets to learn these patterns; however, such datasets are labor-intensive to collect and time-consuming to curate. We present FSDAM (Few-Shot Driver Attention Modeling), a framework that achieves joint attention prediction and caption generation with approximately 100 annotated examples, two orders of magnitude fewer than existing approaches. Our approach introduces a dual-pathway architecture where separate modules handle spatial prediction and caption generation while maintaining semantic consistency through cross-modal alignment. Despite minimal supervision, FSDAM achieves competitive performance on attention prediction, generates coherent, and context-aware explanations. The model demonstrates robust zero-shot generalization across multiple driving benchmarks. This work shows that effective attention-conditioned generation is achievable with limited supervision, opening new possibilities for practical deployment of explainable driver attention systems in data-constrained scenarios.
Problem

Research questions and friction points this paper is trying to address.

Models driver attention with minimal annotated data
Generates context-aware captions for attention shifts
Enables zero-shot generalization across driving benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Few-shot learning with 100 annotated examples
Dual-pathway architecture for spatial-caption alignment
Zero-shot generalization across driving benchmarks
🔎 Similar Papers
K
Kaiser Hamid
Texas Tech University, Lubbock, TX
C
Can Cui
Purdue University, West Lafayette, IN
K
Khandakar Ashrafi Akbar
University of Texas at Dallas, Richardson, TX
Ziran Wang
Ziran Wang
Purdue University
Autonomous DrivingDigital TwinHuman-Centered AIHuman-Autonomy TeamingIntelligent Vehicles
Nade Liang
Nade Liang
Assistant Professor at Texas Tech University
Human FactorsAutonomous DrivingHuman PerformanceCognitive Workload