Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the limited exploration in vision-language-action (VLA) models for autonomous driving during imitation learning, which stems from narrow policy distributions and subsequently hinders reinforcement learning performance. To overcome this, the authors propose the Curious-VLA framework, which enhances exploration diversity during imitation learning through Feasible Trajectory Expansion (FTE) and normalized trajectory representations. In the reinforcement learning phase, they introduce Adaptive Diversity-Aware Sampling (ADAS) and Span Driving Reward (SDR) to improve sensitivity to driving quality. This approach effectively mitigates the exploration–exploitation trade-off and achieves state-of-the-art performance on the Navsim benchmark, with PDMS of 90.3 and EPDMS of 85.4; notably, its Best-of-N PDMS reaches 94.8.

Technology Category

Application Category

📝 Abstract

We identify a fundamental Narrow Policy limitation undermining the performance of autonomous VLA models, where driving Imitation Learning (IL) tends to collapse exploration and limit the potential of subsequent Reinforcement Learning (RL) stages, which often saturate prematurely due to insufficient feedback diversity. Thereby, we propose Curious-VLA, a framework that alleviates the exploit-explore dilemma through a two-stage design. During IL, we introduce a Feasible Trajectory Expansion (FTE) strategy to generate multiple physically valid trajectories and a step-wise normalized trajectory representation to adapt this diverse data. In the RL stage, we present Adaptive Diversity-Aware Sampling (ADAS) that prioritizes high-diversity samples and introduce Spanning Driving Reward (SDR) with a focal style weighting to amplify reward's value span for improving sensitivity to driving quality. On the Navsim benchmark, Curious-VLA achieves SoTA results (PDMS 90.3, EPDMS 85.4) and a Best-of-N PDMS of 94.8, demonstrating its effectiveness in unlocking the exploratory potential of VLA models. Code: https://github.com/Mashiroln/curious_vla.git.

Problem

Research questions and friction points this paper is trying to address.

Narrow Policy

Exploration

Imitation Learning

Reinforcement Learning

VLA Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feasible Trajectory Expansion

Adaptive Diversity-Aware Sampling

Spanning Driving Reward