Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited exploration in vision-language-action (VLA) models for autonomous driving during imitation learning, which stems from narrow policy distributions and subsequently hinders reinforcement learning performance. To overcome this, the authors propose the Curious-VLA framework, which enhances exploration diversity during imitation learning through Feasible Trajectory Expansion (FTE) and normalized trajectory representations. In the reinforcement learning phase, they introduce Adaptive Diversity-Aware Sampling (ADAS) and Span Driving Reward (SDR) to improve sensitivity to driving quality. This approach effectively mitigates the exploration–exploitation trade-off and achieves state-of-the-art performance on the Navsim benchmark, with PDMS of 90.3 and EPDMS of 85.4; notably, its Best-of-N PDMS reaches 94.8.

Technology Category

Application Category

📝 Abstract
We identify a fundamental Narrow Policy limitation undermining the performance of autonomous VLA models, where driving Imitation Learning (IL) tends to collapse exploration and limit the potential of subsequent Reinforcement Learning (RL) stages, which often saturate prematurely due to insufficient feedback diversity. Thereby, we propose Curious-VLA, a framework that alleviates the exploit-explore dilemma through a two-stage design. During IL, we introduce a Feasible Trajectory Expansion (FTE) strategy to generate multiple physically valid trajectories and a step-wise normalized trajectory representation to adapt this diverse data. In the RL stage, we present Adaptive Diversity-Aware Sampling (ADAS) that prioritizes high-diversity samples and introduce Spanning Driving Reward (SDR) with a focal style weighting to amplify reward's value span for improving sensitivity to driving quality. On the Navsim benchmark, Curious-VLA achieves SoTA results (PDMS 90.3, EPDMS 85.4) and a Best-of-N PDMS of 94.8, demonstrating its effectiveness in unlocking the exploratory potential of VLA models. Code: https://github.com/Mashiroln/curious_vla.git.
Problem

Research questions and friction points this paper is trying to address.

Narrow Policy
Exploration
Imitation Learning
Reinforcement Learning
VLA Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feasible Trajectory Expansion
Adaptive Diversity-Aware Sampling
Spanning Driving Reward
Visual-Language-Action Models
Exploration-Exploitation Trade-off
🔎 Similar Papers
No similar papers found.
Canyu Chen
Canyu Chen
CS Ph.D. at Northwestern | Visiting Researcher at UC Berkeley
Foundation AgentTrustworthinessMultimodality
Yuguang Yang
Yuguang Yang
Microsoft, Amazon Alexa AI, Tsinghua University, Johns Hopkins University
Artificial IntelligenceNatural Language ProcessingStochastic Process & ControlComputational Physics
Z
Zhewen Tan
School of Computer Science and Engineering, Beihang University
Y
Yizhi Wang
School of Cyber Science and Technology, Beihang University
R
Ruiyi Zhan
School of Computer Science and Engineering, Beihang University
H
Haiyan Liu
Lenovo Group Limited
X
Xuanyao Mao
Lenovo Group Limited
J
Jason Bao
Lenovo Group Limited
X
Xinyue Tang
Lenovo Group Limited
Linlin Yang
Linlin Yang
Communication University of China
Computer VisionMachine Learning
B
Bingchuan Sun
Lenovo Group Limited
Yan Wang
Yan Wang
Tsinghua university; SenseTime
Neural CompressionComputer VisionMachine Learning
Baochang Zhang
Baochang Zhang
Technische Universität München
Computer assisted interventionMedical image analysisDeep learning