SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

📅 2026-04-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

211K/year
🤖 AI Summary
This work addresses the high inference latency and insufficient robustness of existing vision-language-action (VLA) models in autonomous driving, particularly their difficulty in handling long-tailed and complex reasoning scenarios. The authors propose SpanVLA, a novel framework that integrates an autoregressive vision-language model with a flow-matching action expert. SpanVLA efficiently generates future trajectories through a flow-matching strategy initialized by historical trajectories and introduces, for the first time, a joint learning mechanism combining negative samples and recovery behaviors. Leveraging GRPO post-training and a newly curated mReasoning dataset, SpanVLA significantly reduces inference latency while achieving superior planning performance and robustness on the NAVSIM v1 and v2 benchmarks.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models offer a promising autonomous driving paradigm for leveraging world knowledge and reasoning capabilities, especially in long-tail scenarios. However, existing VLA models often struggle with the high latency in action generation using an autoregressive generation framework and exhibit limited robustness. In this paper, we propose SpanVLA, a novel end-to-end autonomous driving framework, integrating an autoregressive reasoning and a flow-matching action expert. First, SpanVLA introduces an efficient bridge to leverage the vision and reasoning guidance of VLM to efficiently plan future trajectories using a flow-matching policy conditioned on historical trajectory initialization, which significantly reduces inference time. Second, to further improve the performance and robustness of the SpanVLA model, we propose a GRPO-based post-training method to enable the VLA model not only to learn from positive driving samples but also to learn how to avoid the typical negative behaviors and learn recovery behaviors. We further introduce mReasoning, a new real-world driving reasoning dataset, focusing on complex, reasoning-demanding scenarios and negative-recovery samples. Extensive experiments on the NAVSIM (v1 and v2) demonstrate the competitive performance of the SpanVLA model. Additionally, the qualitative results across diverse scenarios highlight the planning performance and robustness of our model.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
autonomous driving
action generation latency
robustness
negative-recovery samples
Innovation

Methods, ideas, or system contributions that make the work stand out.

flow-matching
negative-recovery learning
autoregressive reasoning
GRPO-based post-training
vision-language-action model
🔎 Similar Papers