🤖 AI Summary
This paper addresses the challenge of driver attention understanding and interpretation in driving scenarios. We propose a few-shot joint modeling approach requiring only approximately 100 annotated samples. Our method employs a vision–language dual-path coupled network to simultaneously perform spatial attention prediction and natural language caption generation, with cross-modal contrastive learning enforcing semantic alignment between the two modalities. Key contributions include: (1) a lightweight dual-path architecture that jointly optimizes attention localization and interpretable text generation under extreme label scarcity; (2) strong zero-shot generalization—achieving attention prediction performance on par with fully supervised methods across multiple driving benchmarks, while generating semantically coherent, context-aware captions; and (3) significantly enhanced interpretability of human intent and improved human-factor modeling capability for autonomous driving systems.
📝 Abstract
Understanding where drivers look and why they shift their attention is essential for autonomous systems that read human intent and justify their actions. Most existing models rely on large-scale gaze datasets to learn these patterns; however, such datasets are labor-intensive to collect and time-consuming to curate. We present FSDAM (Few-Shot Driver Attention Modeling), a framework that achieves joint attention prediction and caption generation with approximately 100 annotated examples, two orders of magnitude fewer than existing approaches. Our approach introduces a dual-pathway architecture where separate modules handle spatial prediction and caption generation while maintaining semantic consistency through cross-modal alignment. Despite minimal supervision, FSDAM achieves competitive performance on attention prediction, generates coherent, and context-aware explanations. The model demonstrates robust zero-shot generalization across multiple driving benchmarks. This work shows that effective attention-conditioned generation is achievable with limited supervision, opening new possibilities for practical deployment of explainable driver attention systems in data-constrained scenarios.