ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the limited generalization of existing end-to-end autonomous driving methods under out-of-distribution scenarios and their lack of effective exploration mechanisms. The authors propose a unified perception-and-generation framework that constructs a dense world model to predict future RGB and depth images, providing dense supervisory signals for policy learning. Leveraging prediction uncertainty, the approach drives safe exploration by using the world model’s output as an intrinsic reward. Built upon a Vision-Language-Action architecture, the method employs Group Relative Policy Optimization for policy refinement. Evaluated on the NAVSIM and nuScenes benchmarks, it achieves state-of-the-art performance with a PDMS score of 93.7 and an EPDMS score of 88.8.

Technology Category

Application Category

📝 Abstract

End-to-end autonomous driving models based on Vision-Language-Action (VLA) architectures have shown promising results by learning driving policies through behavior cloning on expert demonstrations. However, imitation learning inherently limits the model to replicating observed behaviors without exploring diverse driving strategies, leaving it brittle in novel or out-of-distribution scenarios. Reinforcement learning (RL) offers a natural remedy by enabling policy exploration beyond the expert distribution. Yet VLA models, typically trained on offline datasets, lack directly observable state transitions, necessitating a learned world model to anticipate action consequences. In this work, we propose a unified understanding-and-generation framework that leverages world modeling to simultaneously enable meaningful exploration and provide dense supervision. Specifically, we augment trajectory prediction with future RGB and depth image generation as dense world modeling objectives, requiring the model to learn fine-grained visual and geometric representations that substantially enrich the planning backbone. Beyond serving as a supervisory signal, the world model further acts as a source of intrinsic reward for policy exploration: its image prediction uncertainty naturally measures a trajectory's novelty relative to the training distribution, where high uncertainty indicates out-of-distribution scenarios that, if safe, represent valuable learning opportunities. We incorporate this exploration signal into a safety-gated reward and optimize the policy via Group Relative Policy Optimization (GRPO). Experiments on the NAVSIM and nuScenes benchmarks demonstrate the effectiveness of our approach, achieving a state-of-the-art PDMS score of 93.7 and an EPDMS of 88.8 on NAVSIM. The code and demo will be publicly available at https://zihaosheng.github.io/ExploreVLA/.

Problem

Research questions and friction points this paper is trying to address.

autonomous driving

exploration

out-of-distribution

imitation learning

world modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

dense world modeling

Vision-Language-Action (VLA)

intrinsic reward