Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing vision-language-action (VLA) models for autonomous driving suffer from low inference efficiency and poor generalization across diverse vehicle configurations and driving scenarios. To address these limitations, this paper proposes an efficient and generalizable VLA framework featuring: (1) learnable action queries integrated with Gaussian-initialized trajectory sampling and chain-of-thought–enhanced multimodal feature modeling; and (2) a unified multi-dataset training paradigm that jointly leverages vision-language pretraining, supervised fine-tuning, and reinforcement learning. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across multiple autonomous driving benchmarks—including nuPlan, Waymo Open Motion, and Nuscenes—while significantly accelerating inference latency. Moreover, it exhibits strong cross-domain generalization, enabling real-time, continuous trajectory generation under complex, dynamic traffic conditions.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference and generalizing to novel autonomous vehicle configurations and driving scenarios. In this paper, we propose Reasoning-VLA, a general and fast action-generation VLA framework. The proposed model employs a set of learnable action queries, initialized via Gaussian sampling from ground-truth trajectories within the training corpus. These learnable queries interact with reasoning-enhanced vision-language features to generate continuous action trajectories in parallel. To promote robust generalization, we consolidate eight publicly available autonomous driving datasets into a standardized, Chain-of-Thought reasoning-based, and easy-to-use data format for model training. Leveraging both supervised learning and reinforcement learning fine-tuning, extensive empirical evaluations across multiple benchmarks demonstrate that Reasoning-VLA achieves state-of-the-art performance, superior generalization capability, and the excellent inference speed reported to date.

Problem

Research questions and friction points this paper is trying to address.

Enhancing autonomous driving decision-making efficiency and speed

Improving generalization across novel vehicle configurations and scenarios

Generating continuous action trajectories via reasoning-enhanced vision-language features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses learnable action queries for trajectory generation

Integrates reasoning-enhanced vision-language features

Consolidates datasets into standardized reasoning format

🔎 Similar Papers

No similar papers found.