VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing driver visual attention prediction methods predominantly rely on static-image, single-frame modeling, failing to capture the temporal evolution of gaze patterns in dynamic driving scenarios. To address this, we propose the first vision-language attention transition modeling framework tailored for driving: built upon the LLaVA architecture, it jointly encodes RGB video frames and human-annotated textual descriptions—encompassing road semantics, risk anticipation, and other high-level intentions—via a multimodal alignment mechanism. We further introduce a contextualized attention module enabling interpretable attention description generation under few-shot or zero-shot settings. Our method significantly outperforms general-purpose vision-language models (VLMs) in both attention transition detection and semantic consistency. Additionally, we design driving-specific evaluation metrics—semantic alignment degree and response diversity—to rigorously assess generalization capability and domain adaptability.

Technology Category

Application Category

📝 Abstract
Driver visual attention prediction is a critical task in autonomous driving and human-computer interaction (HCI) research. Most prior studies focus on estimating attention allocation at a single moment in time, typically using static RGB images such as driving scene pictures. In this work, we propose a vision-language framework that models the changing landscape of drivers' gaze through natural language, using few-shot and zero-shot learning on single RGB images. We curate and refine high-quality captions from the BDD-A dataset using human-in-the-loop feedback, then fine-tune LLaVA to align visual perception with attention-centric scene understanding. Our approach integrates both low-level cues and top-down context (e.g., route semantics, risk anticipation), enabling language-based descriptions of gaze behavior. We evaluate performance across training regimes (few shot, and one-shot) and introduce domain-specific metrics for semantic alignment and response diversity. Results show that our fine-tuned model outperforms general-purpose VLMs in attention shift detection and interpretability. To our knowledge, this is among the first attempts to generate driver visual attention allocation and shifting predictions in natural language, offering a new direction for explainable AI in autonomous driving. Our approach provides a foundation for downstream tasks such as behavior forecasting, human-AI teaming, and multi-agent coordination.
Problem

Research questions and friction points this paper is trying to address.

Predicting dynamic driver visual attention shifts in driving scenes
Modeling gaze behavior via vision-language framework using few-shot learning
Enhancing interpretability of attention allocation through natural language descriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language framework for dynamic gaze prediction
Few-shot and zero-shot learning on RGB images
Fine-tuned LLaVA for attention-centric scene understanding
🔎 Similar Papers
No similar papers found.
K
Kaiser Hamid
Edward E. Whitacre Jr. College of Engineering, Texas Tech University
K
Khandakar Ashrafi Akbar
Erik Jonsson School of Engineering and Computer Science, The University of Texas at Dallas
Nade Liang
Nade Liang
Assistant Professor at Texas Tech University
Human FactorsAutonomous DrivingHuman PerformanceCognitive Workload