Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the challenges of high inference latency, insufficient action precision, and poor decision interpretability in end-to-end autonomous driving by introducing MVLAD-AD, a novel framework that incorporates a masked vision–language–action diffusion model into driving policy learning. To enhance both accuracy and efficiency while preserving physically plausible trajectories, the approach leverages discrete action tokenization, geometry-aware embedding learning, and an action-prioritized decoding mechanism. Experimental results demonstrate that MVLAD-AD significantly outperforms existing autoregressive and diffusion-based methods on standard benchmarks such as nuScenes, achieving state-of-the-art performance in planning accuracy, inference speed, and semantic interpretability.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) and Vision-Language Models (VLMs) have emerged as promising candidates for end-to-end autonomous driving. However, these models typically face challenges in inference latency, action precision, and explainability. Existing autoregressive approaches struggle with slow token-by-token generation, while prior diffusion-based planners often rely on verbose, general-purpose language tokens that lack explicit geometric structure. In this work, we propose Masked Vision-Language-Action Diffusion for Autonomous Driving (MVLAD-AD), a novel framework designed to bridge the gap between efficient planning and semantic explainability via a masked vision-language-action diffusion model. Unlike methods that force actions into the language space, we introduce a discrete action tokenization strategy that constructs a compact codebook of kinematically feasible waypoints from real-world driving distributions. Moreover, we propose geometry-aware embedding learning to ensure that embeddings in the latent space approximate physical geometric metrics. Finally, an action-priority decoding strategy is introduced to prioritize trajectory generation. Extensive experiments on nuScenes and derived benchmarks demonstrate that MVLAD-AD achieves superior efficiency and outperforms state-of-the-art autoregressive and diffusion baselines in planning precision, while providing high-fidelity and explainable reasoning.

Problem

Research questions and friction points this paper is trying to address.

autonomous driving

inference latency

action precision

explainability

diffusion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

masked diffusion

discrete action tokenization

geometry-aware embedding