dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning

📅 2025-12-04
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Existing autoregressive vision-language models (VLMs) suffer from inherent limitations of causal attention and sequential generation, hindering consistent alignment between high-level reasoning and low-level planning—leading to poor out-of-distribution (OOD) generalization in end-to-end autonomous driving systems. To address this, we propose the first discrete diffusion-based VLM for autonomous driving, featuring bidirectional attention and iterative denoising to enable controllable, joint visual–linguistic modeling of perception, structured reasoning, and action planning. Structured prompts guide the generation of reasoning–action pairs, and the model is trained end-to-end. Evaluated on nuScenes and Waymo Open Dataset End-to-End (WOD-E2E), our method improves behavior–trajectory consistency by 9%, enhances long-tail scene recall-for-success (RFS) by 6%, and achieves planning performance competitive with state-of-the-art vision-language action (VLA) systems. Key contributions include: (i) the first application of diffusion-based VLMs to autonomous driving, and (ii) a novel framework for enforcing reasoning–planning consistency.

Technology Category

Application Category

📝 Abstract
The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.
Problem

Research questions and friction points this paper is trying to address.

Enhance autonomous driving generalization in out-of-distribution scenarios
Improve consistency between high-level reasoning and low-level planning
Unify perception, reasoning, and planning via diffusion-based vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion VLM unifies perception, reasoning, planning for driving
Bidirectional attention enables controllable, reliable iterative denoising
Improves consistency and performance in out-of-distribution scenarios
🔎 Similar Papers
No similar papers found.