AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Existing end-to-end Vision-Language-Action (VLA) autonomous driving models suffer from physically infeasible actions, architectural redundancy, and inefficient inference. To address these issues, this paper proposes the first single-stage autoregressive VLA framework that unifies semantic understanding and motion planning, directly generating physically feasible continuous trajectories from visual-language inputs. Key innovations include a novel fast/slow dual-mode inference mechanism and GRPO-based reinforcement fine-tuning for dynamic control of inference depth; joint vision-language-action modeling; discretization-based tokenization of continuous trajectories; and a two-stage training pipeline comprising supervised fine-tuning (SFT) and GRPO optimization. Evaluated on nuPlan, nuScenes, Waymo, and CARLA, the method achieves state-of-the-art performance in both open-loop and closed-loop settings, significantly improving trajectory accuracy and inference efficiency.

Technology Category

Application Category

📝 Abstract

Recent advancements in Vision-Language-Action (VLA) models have shown promise for end-to-end autonomous driving by leveraging world knowledge and reasoning capabilities. However, current VLA models often struggle with physically infeasible action outputs, complex model structures, or unnecessarily long reasoning. In this paper, we propose AutoVLA, a novel VLA model that unifies reasoning and action generation within a single autoregressive generation model for end-to-end autonomous driving. AutoVLA performs semantic reasoning and trajectory planning directly from raw visual inputs and language instructions. We tokenize continuous trajectories into discrete, feasible actions, enabling direct integration into the language model. For training, we employ supervised fine-tuning to equip the model with dual thinking modes: fast thinking (trajectory-only) and slow thinking (enhanced with chain-of-thought reasoning). To further enhance planning performance and efficiency, we introduce a reinforcement fine-tuning method based on Group Relative Policy Optimization (GRPO), reducing unnecessary reasoning in straightforward scenarios. Extensive experiments across real-world and simulated datasets and benchmarks, including nuPlan, nuScenes, Waymo, and CARLA, demonstrate the competitive performance of AutoVLA in both open-loop and closed-loop settings. Qualitative results showcase the adaptive reasoning and accurate planning capabilities of AutoVLA in diverse scenarios.

Problem

Research questions and friction points this paper is trying to address.

Addresses physically infeasible actions in VLA models

Simplifies complex model structures for autonomous driving

Reduces unnecessary long reasoning in trajectory planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified autoregressive model for reasoning and action

Tokenized continuous trajectories into discrete actions

Reinforcement fine-tuning with GRPO for efficiency

🔎 Similar Papers

No similar papers found.