🤖 AI Summary
Existing vision-language-action (VLA) models for autonomous driving suffer from low inference efficiency and poor generalization across diverse vehicle configurations and driving scenarios. To address these limitations, this paper proposes an efficient and generalizable VLA framework featuring: (1) learnable action queries integrated with Gaussian-initialized trajectory sampling and chain-of-thought–enhanced multimodal feature modeling; and (2) a unified multi-dataset training paradigm that jointly leverages vision-language pretraining, supervised fine-tuning, and reinforcement learning. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across multiple autonomous driving benchmarks—including nuPlan, Waymo Open Motion, and Nuscenes—while significantly accelerating inference latency. Moreover, it exhibits strong cross-domain generalization, enabling real-time, continuous trajectory generation under complex, dynamic traffic conditions.
📝 Abstract
Vision-Language-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference and generalizing to novel autonomous vehicle configurations and driving scenarios. In this paper, we propose Reasoning-VLA, a general and fast action-generation VLA framework. The proposed model employs a set of learnable action queries, initialized via Gaussian sampling from ground-truth trajectories within the training corpus. These learnable queries interact with reasoning-enhanced vision-language features to generate continuous action trajectories in parallel. To promote robust generalization, we consolidate eight publicly available autonomous driving datasets into a standardized, Chain-of-Thought reasoning-based, and easy-to-use data format for model training. Leveraging both supervised learning and reinforcement learning fine-tuning, extensive empirical evaluations across multiple benchmarks demonstrate that Reasoning-VLA achieves state-of-the-art performance, superior generalization capability, and the excellent inference speed reported to date.