🤖 AI Summary
To address the inefficiency and inflexibility in end-to-end autonomous driving trajectory generation—stemming from reliance on autoregressive large language models or continuous diffusion policies—this paper proposes WAM-Diff, the first trajectory generation method to incorporate discrete masked diffusion into a vision-language-action (VLA) framework. Its core contributions are: (1) a driving-oriented masked diffusion trajectory modeling paradigm enabling non-causal, iterative sequence refinement; (2) a sparse Mixture-of-Experts architecture jointly trained for motion prediction and driving-oriented visual question answering; and (3) Group Sequence Policy Optimization (GSPO), an online reinforcement learning algorithm optimizing sequence-level driving rewards. Evaluated on NAVSIM-v1 and NAVSIM-v2, WAM-Diff achieves 91.0 PDMS and 89.7 EPDMS, respectively—substantially outperforming state-of-the-art autoregressive and continuous diffusion baselines. These results validate the effectiveness and superiority of discrete masked diffusion for autonomous driving trajectory generation.
📝 Abstract
End-to-end autonomous driving systems based on vision-language-action (VLA) models integrate multimodal sensor inputs and language instructions to generate planning and control signals. While autoregressive large language models and continuous diffusion policies are prevalent, the potential of discrete masked diffusion for trajectory generation remains largely unexplored. This paper presents WAM-Diff, a VLA framework that employs masked diffusion to iteratively refine a discrete sequence representing future ego-trajectories. Our approach features three key innovations: a systematic adaptation of masked diffusion for autonomous driving that supports flexible, non-causal decoding orders; scalable model capacity via a sparse MoE architecture trained jointly on motion prediction and driving-oriented visual question answering (VQA); and online reinforcement learning using Group Sequence Policy Optimization (GSPO) to optimize sequence-level driving rewards. Remarkably, our model achieves 91.0 PDMS on NAVSIM-v1 and 89.7 EPDMS on NAVSIM-v2, demonstrating the effectiveness of masked diffusion for autonomous driving. The approach provides a promising alternative to autoregressive and diffusion-based policies, supporting scenario-aware decoding strategies for trajectory generation. The code for this paper will be released publicly at: https://github.com/fudan-generative-vision/WAM-Diff