SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the high inference latency and insufficient robustness of existing vision-language-action (VLA) models in autonomous driving, particularly their difficulty in handling long-tailed and complex reasoning scenarios. The authors propose SpanVLA, a novel framework that integrates an autoregressive vision-language model with a flow-matching action expert. SpanVLA efficiently generates future trajectories through a flow-matching strategy initialized by historical trajectories and introduces, for the first time, a joint learning mechanism combining negative samples and recovery behaviors. Leveraging GRPO post-training and a newly curated mReasoning dataset, SpanVLA significantly reduces inference latency while achieving superior planning performance and robustness on the NAVSIM v1 and v2 benchmarks.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models offer a promising autonomous driving paradigm for leveraging world knowledge and reasoning capabilities, especially in long-tail scenarios. However, existing VLA models often struggle with the high latency in action generation using an autoregressive generation framework and exhibit limited robustness. In this paper, we propose SpanVLA, a novel end-to-end autonomous driving framework, integrating an autoregressive reasoning and a flow-matching action expert. First, SpanVLA introduces an efficient bridge to leverage the vision and reasoning guidance of VLM to efficiently plan future trajectories using a flow-matching policy conditioned on historical trajectory initialization, which significantly reduces inference time. Second, to further improve the performance and robustness of the SpanVLA model, we propose a GRPO-based post-training method to enable the VLA model not only to learn from positive driving samples but also to learn how to avoid the typical negative behaviors and learn recovery behaviors. We further introduce mReasoning, a new real-world driving reasoning dataset, focusing on complex, reasoning-demanding scenarios and negative-recovery samples. Extensive experiments on the NAVSIM (v1 and v2) demonstrate the competitive performance of the SpanVLA model. Additionally, the qualitative results across diverse scenarios highlight the planning performance and robustness of our model.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

autonomous driving

action generation latency

robustness

negative-recovery samples

Innovation

Methods, ideas, or system contributions that make the work stand out.

flow-matching

negative-recovery learning

autoregressive reasoning