DiffAttn: Diffusion-Based Drivers' Visual Attention Prediction with LLM-Enhanced Semantic Reasoning

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a conditional diffusion-based approach for predicting driver visual attention to enhance intelligent vehicles’ perception and decision-making regarding potential risks. It introduces diffusion models to this task for the first time, integrating a Swin Transformer encoder with a multi-scale feature fusion pyramid decoder, and leverages a large language model (LLM) to strengthen top-down, safety-critical semantic reasoning. The framework jointly models local visual details and global scene context through a novel multi-scale conditional diffusion mechanism, significantly improving fine-grained attention prediction. Evaluated on four public datasets, the method outperforms existing baselines—including those based on video inputs, top-down cues, and LLM-enhanced approaches—while enabling interpretable, driver-centric scene understanding.
📝 Abstract
Drivers' visual attention provides critical cues for anticipating latent hazards and directly shapes decision-making and control maneuvers, where its absence can compromise traffic safety. To emulate drivers' perception patterns and advance visual attention prediction for intelligent vehicles, we propose DiffAttn, a diffusion-based framework that formulates this task as a conditional diffusion-denoising process, enabling more accurate modeling of drivers' attention. To capture both local and global scene features, we adopt Swin Transformer as encoder and design a decoder that combines a Feature Fusion Pyramid for cross-layer interaction with dense, multi-scale conditional diffusion to jointly enhance denoising learning and model fine-grained local and global scene contexts. Additionally, a large language model (LLM) layer is incorporated to enhance top-down semantic reasoning and improve sensitivity to safety-critical cues. Extensive experiments on four public datasets demonstrate that DiffAttn achieves state-of-the-art (SoTA) performance, surpassing most video-based, top-down-feature-driven, and LLM-enhanced baselines. Our framework further supports interpretable driver-centric scene understanding and has the potential to improve in-cabin human-machine interaction, risk perception, and drivers' state measurement in intelligent vehicles.
Problem

Research questions and friction points this paper is trying to address.

visual attention prediction
driver perception
traffic safety
hazard anticipation
intelligent vehicles
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion-based attention prediction
Swin Transformer
Feature Fusion Pyramid
LLM-enhanced semantic reasoning
multi-scale conditional diffusion
🔎 Similar Papers
No similar papers found.
Weimin Liu
Weimin Liu
Assistant Professor ,School of Physical Science and Technology, ShanghaiTech University
ultrafast spectroscopy
Q
Qingkun Li
Beijing Key Laboratory of Human-Computer Interaction, Institute of Software, Chinese Academy of Sciences, Beijing, China
J
Jiyuan Qiu
Remote Sensing and Earth Observation Laboratory, University of Copenhagen, Copenhagen K, Denmark
Wenjun Wang
Wenjun Wang
Tianjin University
Data MiningSocial NetworkComplex NetworkSmart City
J
Joshua H. Meng
California PATH, University of California, Berkeley, CA, USA