VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing driver visual attention prediction methods predominantly rely on static-image, single-frame modeling, failing to capture the temporal evolution of gaze patterns in dynamic driving scenarios. To address this, we propose the first vision-language attention transition modeling framework tailored for driving: built upon the LLaVA architecture, it jointly encodes RGB video frames and human-annotated textual descriptions—encompassing road semantics, risk anticipation, and other high-level intentions—via a multimodal alignment mechanism. We further introduce a contextualized attention module enabling interpretable attention description generation under few-shot or zero-shot settings. Our method significantly outperforms general-purpose vision-language models (VLMs) in both attention transition detection and semantic consistency. Additionally, we design driving-specific evaluation metrics—semantic alignment degree and response diversity—to rigorously assess generalization capability and domain adaptability.

Technology Category

Application Category

📝 Abstract

Driver visual attention prediction is a critical task in autonomous driving and human-computer interaction (HCI) research. Most prior studies focus on estimating attention allocation at a single moment in time, typically using static RGB images such as driving scene pictures. In this work, we propose a vision-language framework that models the changing landscape of drivers' gaze through natural language, using few-shot and zero-shot learning on single RGB images. We curate and refine high-quality captions from the BDD-A dataset using human-in-the-loop feedback, then fine-tune LLaVA to align visual perception with attention-centric scene understanding. Our approach integrates both low-level cues and top-down context (e.g., route semantics, risk anticipation), enabling language-based descriptions of gaze behavior. We evaluate performance across training regimes (few shot, and one-shot) and introduce domain-specific metrics for semantic alignment and response diversity. Results show that our fine-tuned model outperforms general-purpose VLMs in attention shift detection and interpretability. To our knowledge, this is among the first attempts to generate driver visual attention allocation and shifting predictions in natural language, offering a new direction for explainable AI in autonomous driving. Our approach provides a foundation for downstream tasks such as behavior forecasting, human-AI teaming, and multi-agent coordination.

Problem

Research questions and friction points this paper is trying to address.

Predicting dynamic driver visual attention shifts in driving scenes

Modeling gaze behavior via vision-language framework using few-shot learning

Enhancing interpretability of attention allocation through natural language descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language framework for dynamic gaze prediction

Few-shot and zero-shot learning on RGB images

Fine-tuned LLaVA for attention-centric scene understanding

🔎 Similar Papers

No similar papers found.

Authors to Follow