🤖 AI Summary
This work proposes a conditional diffusion-based approach for predicting driver visual attention to enhance intelligent vehicles’ perception and decision-making regarding potential risks. It introduces diffusion models to this task for the first time, integrating a Swin Transformer encoder with a multi-scale feature fusion pyramid decoder, and leverages a large language model (LLM) to strengthen top-down, safety-critical semantic reasoning. The framework jointly models local visual details and global scene context through a novel multi-scale conditional diffusion mechanism, significantly improving fine-grained attention prediction. Evaluated on four public datasets, the method outperforms existing baselines—including those based on video inputs, top-down cues, and LLM-enhanced approaches—while enabling interpretable, driver-centric scene understanding.
📝 Abstract
Drivers' visual attention provides critical cues for anticipating latent hazards and directly shapes decision-making and control maneuvers, where its absence can compromise traffic safety. To emulate drivers' perception patterns and advance visual attention prediction for intelligent vehicles, we propose DiffAttn, a diffusion-based framework that formulates this task as a conditional diffusion-denoising process, enabling more accurate modeling of drivers' attention. To capture both local and global scene features, we adopt Swin Transformer as encoder and design a decoder that combines a Feature Fusion Pyramid for cross-layer interaction with dense, multi-scale conditional diffusion to jointly enhance denoising learning and model fine-grained local and global scene contexts. Additionally, a large language model (LLM) layer is incorporated to enhance top-down semantic reasoning and improve sensitivity to safety-critical cues. Extensive experiments on four public datasets demonstrate that DiffAttn achieves state-of-the-art (SoTA) performance, surpassing most video-based, top-down-feature-driven, and LLM-enhanced baselines. Our framework further supports interpretable driver-centric scene understanding and has the potential to improve in-cabin human-machine interaction, risk perception, and drivers' state measurement in intelligent vehicles.