Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing end-to-end vision-language-action (VLA) autonomous driving models rely heavily on imitation learning, limiting their ability to internalize physical constraints and often necessitating rule-based post-processing or costly gradient-based optimization. This work introduces ReflectDrive, the first framework integrating discrete diffusion modeling with a safety-aware reflection mechanism. It constructs an action codebook via 2D spatial discretization, employs goal-conditioned planning to generate initial trajectories, and enables gradient-free local self-correction through inpainting-style masked regeneration. By leveraging the multimodal understanding capabilities of vision-language models, ReflectDrive supports efficient and scalable end-to-end trajectory optimization. Evaluated on the NAVSIM benchmark, it significantly improves trajectory reliability in safety-critical scenarios—including unprotected left turns and emergency evasive maneuvers—demonstrating both real-system efficacy and strong generalization potential.

Technology Category

Application Category

📝 Abstract

End-to-End (E2E) solutions have emerged as a mainstream approach for autonomous driving systems, with Vision-Language-Action (VLA) models representing a new paradigm that leverages pre-trained multimodal knowledge from Vision-Language Models (VLMs) to interpret and interact with complex real-world environments. However, these methods remain constrained by the limitations of imitation learning, which struggles to inherently encode physical rules during training. Existing approaches often rely on complex rule-based post-refinement, employ reinforcement learning that remains largely limited to simulation, or utilize diffusion guidance that requires computationally expensive gradient calculations. To address these challenges, we introduce ReflectDrive, a novel learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion. We first discretize the two-dimensional driving space to construct an action codebook, enabling the use of pre-trained Diffusion Language Models for planning tasks through fine-tuning. Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient computation. Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors. Based on this, we apply local search methods to identify unsafe tokens and determine feasible solutions, which then serve as safe anchors for inpainting-based regeneration. Evaluated on the NAVSIM benchmark, ReflectDrive demonstrates significant advantages in safety-critical trajectory generation, offering a scalable and reliable solution for autonomous driving systems.

Problem

Research questions and friction points this paper is trying to address.

Overcoming imitation learning limitations in autonomous driving VLMs

Eliminating computationally expensive gradient calculations for trajectory refinement

Addressing safety-critical trajectory generation without complex rule-based systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete diffusion for trajectory generation

Safety-aware reflection without gradient computation

Token-based local search for unsafe correction

🔎 Similar Papers

No similar papers found.