🤖 AI Summary
Autoregressive decoding in autonomous vehicle trajectory planning suffers from inefficiency and limited scalability. Method: This paper proposes a parallel coarse-to-fine discrete flow matching paradigm, modeling trajectories as discrete flow matching over a structured token space. It introduces a metric-aligned numerical tokenizer, geometry-aware flow objectives, and a simulation-guided GRPO alignment mechanism. A non-causal flow model architecture is adopted, integrated with triplet-margin learning, multi-stage adaptation of pretrained Vision-Language-Action (VLA) models, and joint multimodal continual pretraining with consistency regularization. Results: On NAVSIM v1, the method achieves 89.1 PDMS in single-step inference and 90.3 PDMS in five-step inference—substantially outperforming autoregressive and diffusion-based baselines—while simultaneously improving computational efficiency, trajectory accuracy, and driving safety.
📝 Abstract
We introduce WAM-Flow, a vision-language-action (VLA) model that casts ego-trajectory planning as discrete flow matching over a structured token space. In contrast to autoregressive decoders, WAM-Flow performs fully parallel, bidirectional denoising, enabling coarse-to-fine refinement with a tunable compute-accuracy trade-off. Specifically, the approach combines a metric-aligned numerical tokenizer that preserves scalar geometry via triplet-margin learning, a geometry-aware flow objective and a simulator-guided GRPO alignment that integrates safety, ego progress, and comfort rewards while retaining parallel generation. A multi-stage adaptation converts a pre-trained auto-regressive backbone (Janus-1.5B) from causal decoding to non-causal flow model and strengthens road-scene competence through continued multimodal pretraining. Thanks to the inherent nature of consistency model training and parallel decoding inference, WAM-Flow achieves superior closed-loop performance against autoregressive and diffusion-based VLA baselines, with 1-step inference attaining 89.1 PDMS and 5-step inference reaching 90.3 PDMS on NAVSIM v1 benchmark. These results establish discrete flow matching as a new promising paradigm for end-to-end autonomous driving. The code will be publicly available soon.