đ¤ AI Summary
Existing autoregressive vision-language models (VLMs) suffer from inherent limitations of causal attention and sequential generation, hindering consistent alignment between high-level reasoning and low-level planningâleading to poor out-of-distribution (OOD) generalization in end-to-end autonomous driving systems. To address this, we propose the first discrete diffusion-based VLM for autonomous driving, featuring bidirectional attention and iterative denoising to enable controllable, joint visualâlinguistic modeling of perception, structured reasoning, and action planning. Structured prompts guide the generation of reasoningâaction pairs, and the model is trained end-to-end. Evaluated on nuScenes and Waymo Open Dataset End-to-End (WOD-E2E), our method improves behaviorâtrajectory consistency by 9%, enhances long-tail scene recall-for-success (RFS) by 6%, and achieves planning performance competitive with state-of-the-art vision-language action (VLA) systems. Key contributions include: (i) the first application of diffusion-based VLMs to autonomous driving, and (ii) a novel framework for enforcing reasoningâplanning consistency.
đ Abstract
The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.