🤖 AI Summary
Existing end-to-end autonomous driving vision-language-action (VLA) models face two key bottlenecks: poor generalization of open-loop imitation learning and high computational cost and reliance on high-fidelity simulation in closed-loop reinforcement learning (RL). This paper proposes a closed-loop RL framework integrating inverse reinforcement learning (IRL) with a reward-based world model. First, we introduce a lightweight reward world model enabling low-cost, simulation-free closed-loop training. Second, we employ IRL to implicitly infer a multi-objective reward function—encompassing safety, comfort, and efficiency—from expert demonstrations, which guides proximal policy optimization (PPO) for policy refinement. Our approach synergistically combines VLA architecture, imitation pretraining, IRL-based reward modeling, and closed-loop RL fine-tuning. Evaluated on the NAVSIM v2 benchmark, it achieves state-of-the-art performance and secured second place in the CVPR 2025 Autonomous Driving Grand Challenge.
📝 Abstract
Vision-Language-Action (VLA) models have demonstrated potential in autonomous driving. However, two critical challenges hinder their development: (1) Existing VLA architectures are typically based on imitation learning in open-loop setup which tends to capture the recorded behaviors in the dataset, leading to suboptimal and constrained performance, (2) Close-loop training relies heavily on high-fidelity sensor simulation, where domain gaps and computational inefficiencies pose significant barriers. In this paper, we introduce IRL-VLA, a novel close-loop Reinforcement Learning via extbf{I}nverse extbf{R}einforcement extbf{L}earning reward world model with a self-built VLA approach. Our framework proceeds in a three-stage paradigm: In the first stage, we propose a VLA architecture and pretrain the VLA policy via imitation learning. In the second stage, we construct a lightweight reward world model via inverse reinforcement learning to enable efficient close-loop reward computation. To further enhance planning performance, finally, we design specialized reward world model guidence reinforcement learning via PPO(Proximal Policy Optimization) to effectively balance the safety incidents, comfortable driving, and traffic efficiency. Our approach achieves state-of-the-art performance in NAVSIM v2 end-to-end driving benchmark, 1st runner up in CVPR2025 Autonomous Grand Challenge. We hope that our framework will accelerate VLA research in close-loop autonomous driving.