ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Existing end-to-end autonomous driving vision-language models (VLMs) predominantly adopt autoregressive architectures, suffering from high inference latency and an inability to model bidirectional context—limitations that hinder real-time responsiveness and safety-critical decision-making in dynamic driving scenarios. To address these challenges, we propose ViLaD: the first framework integrating masked diffusion modeling into autonomous driving decision-making. ViLaD enables parallel generation of driving action sequences, achieving low-latency inference, bidirectional contextual modeling, and progressive optimization (from easy to hard). By unifying large-scale vision-language understanding with diffusion-based multi-step prediction, ViLaD significantly outperforms autoregressive VLM baselines on nuScenes—delivering higher planning accuracy and markedly faster inference. Real-world vehicle validation demonstrates near-zero failure rates in interactive parking tasks.

Technology Category

Application Category

📝 Abstract

End-to-end autonomous driving systems built on Vision Language Models (VLMs) have shown significant promise, yet their reliance on autoregressive architectures introduces some limitations for real-world applications. The sequential, token-by-token generation process of these models results in high inference latency and cannot perform bidirectional reasoning, making them unsuitable for dynamic, safety-critical environments. To overcome these challenges, we introduce ViLaD, a novel Large Vision Language Diffusion (LVLD) framework for end-to-end autonomous driving that represents a paradigm shift. ViLaD leverages a masked diffusion model that enables parallel generation of entire driving decision sequences, significantly reducing computational latency. Moreover, its architecture supports bidirectional reasoning, allowing the model to consider both past and future simultaneously, and supports progressive easy-first generation to iteratively improve decision quality. We conduct comprehensive experiments on the nuScenes dataset, where ViLaD outperforms state-of-the-art autoregressive VLM baselines in both planning accuracy and inference speed, while achieving a near-zero failure rate. Furthermore, we demonstrate the framework's practical viability through a real-world deployment on an autonomous vehicle for an interactive parking task, confirming its effectiveness and soundness for practical applications.

Problem

Research questions and friction points this paper is trying to address.

High inference latency in autoregressive VLMs for autonomous driving

Lack of bidirectional reasoning in dynamic driving environments

Sequential token generation limits real-time decision-making in safety-critical systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel generation via masked diffusion model

Bidirectional reasoning for past and future

Progressive easy-first decision quality improvement

🔎 Similar Papers

MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving