WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving

📅 2025-12-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inefficiency and inflexibility in end-to-end autonomous driving trajectory generation—stemming from reliance on autoregressive large language models or continuous diffusion policies—this paper proposes WAM-Diff, the first trajectory generation method to incorporate discrete masked diffusion into a vision-language-action (VLA) framework. Its core contributions are: (1) a driving-oriented masked diffusion trajectory modeling paradigm enabling non-causal, iterative sequence refinement; (2) a sparse Mixture-of-Experts architecture jointly trained for motion prediction and driving-oriented visual question answering; and (3) Group Sequence Policy Optimization (GSPO), an online reinforcement learning algorithm optimizing sequence-level driving rewards. Evaluated on NAVSIM-v1 and NAVSIM-v2, WAM-Diff achieves 91.0 PDMS and 89.7 EPDMS, respectively—substantially outperforming state-of-the-art autoregressive and continuous diffusion baselines. These results validate the effectiveness and superiority of discrete masked diffusion for autonomous driving trajectory generation.

Technology Category

Application Category

📝 Abstract
End-to-end autonomous driving systems based on vision-language-action (VLA) models integrate multimodal sensor inputs and language instructions to generate planning and control signals. While autoregressive large language models and continuous diffusion policies are prevalent, the potential of discrete masked diffusion for trajectory generation remains largely unexplored. This paper presents WAM-Diff, a VLA framework that employs masked diffusion to iteratively refine a discrete sequence representing future ego-trajectories. Our approach features three key innovations: a systematic adaptation of masked diffusion for autonomous driving that supports flexible, non-causal decoding orders; scalable model capacity via a sparse MoE architecture trained jointly on motion prediction and driving-oriented visual question answering (VQA); and online reinforcement learning using Group Sequence Policy Optimization (GSPO) to optimize sequence-level driving rewards. Remarkably, our model achieves 91.0 PDMS on NAVSIM-v1 and 89.7 EPDMS on NAVSIM-v2, demonstrating the effectiveness of masked diffusion for autonomous driving. The approach provides a promising alternative to autoregressive and diffusion-based policies, supporting scenario-aware decoding strategies for trajectory generation. The code for this paper will be released publicly at: https://github.com/fudan-generative-vision/WAM-Diff
Problem

Research questions and friction points this paper is trying to address.

Develops a masked diffusion VLA framework for autonomous driving trajectory generation
Integrates sparse MoE architecture for scalable model capacity in driving tasks
Uses online reinforcement learning to optimize sequence-level driving performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked diffusion for iterative trajectory refinement
Sparse MoE architecture for scalable model capacity
Online reinforcement learning with GSPO for sequence-level rewards
🔎 Similar Papers
No similar papers found.