DiffAttn: Diffusion-Based Drivers' Visual Attention Prediction with LLM-Enhanced Semantic Reasoning

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work proposes a conditional diffusion-based approach for predicting driver visual attention to enhance intelligent vehicles’ perception and decision-making regarding potential risks. It introduces diffusion models to this task for the first time, integrating a Swin Transformer encoder with a multi-scale feature fusion pyramid decoder, and leverages a large language model (LLM) to strengthen top-down, safety-critical semantic reasoning. The framework jointly models local visual details and global scene context through a novel multi-scale conditional diffusion mechanism, significantly improving fine-grained attention prediction. Evaluated on four public datasets, the method outperforms existing baselines—including those based on video inputs, top-down cues, and LLM-enhanced approaches—while enabling interpretable, driver-centric scene understanding.

Technology Category

Application Category

📝 Abstract

Drivers' visual attention provides critical cues for anticipating latent hazards and directly shapes decision-making and control maneuvers, where its absence can compromise traffic safety. To emulate drivers' perception patterns and advance visual attention prediction for intelligent vehicles, we propose DiffAttn, a diffusion-based framework that formulates this task as a conditional diffusion-denoising process, enabling more accurate modeling of drivers' attention. To capture both local and global scene features, we adopt Swin Transformer as encoder and design a decoder that combines a Feature Fusion Pyramid for cross-layer interaction with dense, multi-scale conditional diffusion to jointly enhance denoising learning and model fine-grained local and global scene contexts. Additionally, a large language model (LLM) layer is incorporated to enhance top-down semantic reasoning and improve sensitivity to safety-critical cues. Extensive experiments on four public datasets demonstrate that DiffAttn achieves state-of-the-art (SoTA) performance, surpassing most video-based, top-down-feature-driven, and LLM-enhanced baselines. Our framework further supports interpretable driver-centric scene understanding and has the potential to improve in-cabin human-machine interaction, risk perception, and drivers' state measurement in intelligent vehicles.

Problem

Research questions and friction points this paper is trying to address.

visual attention prediction

driver perception

traffic safety

hazard anticipation

intelligent vehicles

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion-based attention prediction

Swin Transformer

Feature Fusion Pyramid