FSDAM: Few-Shot Driving Attention Modeling via Vision-Language Coupling

📅 2025-11-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper addresses the challenge of driver attention understanding and interpretation in driving scenarios. We propose a few-shot joint modeling approach requiring only approximately 100 annotated samples. Our method employs a vision–language dual-path coupled network to simultaneously perform spatial attention prediction and natural language caption generation, with cross-modal contrastive learning enforcing semantic alignment between the two modalities. Key contributions include: (1) a lightweight dual-path architecture that jointly optimizes attention localization and interpretable text generation under extreme label scarcity; (2) strong zero-shot generalization—achieving attention prediction performance on par with fully supervised methods across multiple driving benchmarks, while generating semantically coherent, context-aware captions; and (3) significantly enhanced interpretability of human intent and improved human-factor modeling capability for autonomous driving systems.

Technology Category

Application Category

📝 Abstract

Understanding where drivers look and why they shift their attention is essential for autonomous systems that read human intent and justify their actions. Most existing models rely on large-scale gaze datasets to learn these patterns; however, such datasets are labor-intensive to collect and time-consuming to curate. We present FSDAM (Few-Shot Driver Attention Modeling), a framework that achieves joint attention prediction and caption generation with approximately 100 annotated examples, two orders of magnitude fewer than existing approaches. Our approach introduces a dual-pathway architecture where separate modules handle spatial prediction and caption generation while maintaining semantic consistency through cross-modal alignment. Despite minimal supervision, FSDAM achieves competitive performance on attention prediction, generates coherent, and context-aware explanations. The model demonstrates robust zero-shot generalization across multiple driving benchmarks. This work shows that effective attention-conditioned generation is achievable with limited supervision, opening new possibilities for practical deployment of explainable driver attention systems in data-constrained scenarios.

Problem

Research questions and friction points this paper is trying to address.

Models driver attention with minimal annotated data

Generates context-aware captions for attention shifts

Enables zero-shot generalization across driving benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Few-shot learning with 100 annotated examples

Dual-pathway architecture for spatial-caption alignment

Zero-shot generalization across driving benchmarks

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Authors to Follow