🤖 AI Summary
Existing end-to-end autonomous driving vision-language models (VLMs) predominantly adopt autoregressive architectures, suffering from high inference latency and an inability to model bidirectional context—limitations that hinder real-time responsiveness and safety-critical decision-making in dynamic driving scenarios. To address these challenges, we propose ViLaD: the first framework integrating masked diffusion modeling into autonomous driving decision-making. ViLaD enables parallel generation of driving action sequences, achieving low-latency inference, bidirectional contextual modeling, and progressive optimization (from easy to hard). By unifying large-scale vision-language understanding with diffusion-based multi-step prediction, ViLaD significantly outperforms autoregressive VLM baselines on nuScenes—delivering higher planning accuracy and markedly faster inference. Real-world vehicle validation demonstrates near-zero failure rates in interactive parking tasks.
📝 Abstract
End-to-end autonomous driving systems built on Vision Language Models (VLMs) have shown significant promise, yet their reliance on autoregressive architectures introduces some limitations for real-world applications. The sequential, token-by-token generation process of these models results in high inference latency and cannot perform bidirectional reasoning, making them unsuitable for dynamic, safety-critical environments. To overcome these challenges, we introduce ViLaD, a novel Large Vision Language Diffusion (LVLD) framework for end-to-end autonomous driving that represents a paradigm shift. ViLaD leverages a masked diffusion model that enables parallel generation of entire driving decision sequences, significantly reducing computational latency. Moreover, its architecture supports bidirectional reasoning, allowing the model to consider both past and future simultaneously, and supports progressive easy-first generation to iteratively improve decision quality. We conduct comprehensive experiments on the nuScenes dataset, where ViLaD outperforms state-of-the-art autoregressive VLM baselines in both planning accuracy and inference speed, while achieving a near-zero failure rate. Furthermore, we demonstrate the framework's practical viability through a real-world deployment on an autonomous vehicle for an interactive parking task, confirming its effectiveness and soundness for practical applications.