ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing end-to-end autonomous driving vision-language models (VLMs) predominantly adopt autoregressive architectures, suffering from high inference latency and an inability to model bidirectional context—limitations that hinder real-time responsiveness and safety-critical decision-making in dynamic driving scenarios. To address these challenges, we propose ViLaD: the first framework integrating masked diffusion modeling into autonomous driving decision-making. ViLaD enables parallel generation of driving action sequences, achieving low-latency inference, bidirectional contextual modeling, and progressive optimization (from easy to hard). By unifying large-scale vision-language understanding with diffusion-based multi-step prediction, ViLaD significantly outperforms autoregressive VLM baselines on nuScenes—delivering higher planning accuracy and markedly faster inference. Real-world vehicle validation demonstrates near-zero failure rates in interactive parking tasks.

Technology Category

Application Category

📝 Abstract
End-to-end autonomous driving systems built on Vision Language Models (VLMs) have shown significant promise, yet their reliance on autoregressive architectures introduces some limitations for real-world applications. The sequential, token-by-token generation process of these models results in high inference latency and cannot perform bidirectional reasoning, making them unsuitable for dynamic, safety-critical environments. To overcome these challenges, we introduce ViLaD, a novel Large Vision Language Diffusion (LVLD) framework for end-to-end autonomous driving that represents a paradigm shift. ViLaD leverages a masked diffusion model that enables parallel generation of entire driving decision sequences, significantly reducing computational latency. Moreover, its architecture supports bidirectional reasoning, allowing the model to consider both past and future simultaneously, and supports progressive easy-first generation to iteratively improve decision quality. We conduct comprehensive experiments on the nuScenes dataset, where ViLaD outperforms state-of-the-art autoregressive VLM baselines in both planning accuracy and inference speed, while achieving a near-zero failure rate. Furthermore, we demonstrate the framework's practical viability through a real-world deployment on an autonomous vehicle for an interactive parking task, confirming its effectiveness and soundness for practical applications.
Problem

Research questions and friction points this paper is trying to address.

High inference latency in autoregressive VLMs for autonomous driving
Lack of bidirectional reasoning in dynamic driving environments
Sequential token generation limits real-time decision-making in safety-critical systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel generation via masked diffusion model
Bidirectional reasoning for past and future
Progressive easy-first decision quality improvement
🔎 Similar Papers
No similar papers found.
C
Can Cui
College of Engineering, Purdue University, West Lafayette, IN 47907, USA
Y
Yupeng Zhou
College of Engineering, Purdue University, West Lafayette, IN 47907, USA
Juntong Peng
Juntong Peng
Purdue University
Autonomous DrivingCollaborative PerceptionFoundation ModelIntelligent Transportation
S
Sung-Yeon Park
College of Engineering, Purdue University, West Lafayette, IN 47907, USA
Zichong Yang
Zichong Yang
Purdue University
Engineering Systems DesignAutonomous DrivingFoundation Model
P
Prashanth Sankaranarayanan
College of Engineering, Purdue University, West Lafayette, IN 47907, USA
Jiaru Zhang
Jiaru Zhang
Purdue University
Ruqi Zhang
Ruqi Zhang
Assistant Professor of Computer Science, Purdue University
Machine LearningArtificial IntelligenceDeep LearningStatistics
Ziran Wang
Ziran Wang
Purdue University
Autonomous DrivingDigital TwinHuman-Centered AIHuman-Autonomy TeamingIntelligent Vehicles