IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing end-to-end autonomous driving vision-language-action (VLA) models face two key bottlenecks: poor generalization of open-loop imitation learning and high computational cost and reliance on high-fidelity simulation in closed-loop reinforcement learning (RL). This paper proposes a closed-loop RL framework integrating inverse reinforcement learning (IRL) with a reward-based world model. First, we introduce a lightweight reward world model enabling low-cost, simulation-free closed-loop training. Second, we employ IRL to implicitly infer a multi-objective reward function—encompassing safety, comfort, and efficiency—from expert demonstrations, which guides proximal policy optimization (PPO) for policy refinement. Our approach synergistically combines VLA architecture, imitation pretraining, IRL-based reward modeling, and closed-loop RL fine-tuning. Evaluated on the NAVSIM v2 benchmark, it achieves state-of-the-art performance and secured second place in the CVPR 2025 Autonomous Driving Grand Challenge.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have demonstrated potential in autonomous driving. However, two critical challenges hinder their development: (1) Existing VLA architectures are typically based on imitation learning in open-loop setup which tends to capture the recorded behaviors in the dataset, leading to suboptimal and constrained performance, (2) Close-loop training relies heavily on high-fidelity sensor simulation, where domain gaps and computational inefficiencies pose significant barriers. In this paper, we introduce IRL-VLA, a novel close-loop Reinforcement Learning via extbf{I}nverse extbf{R}einforcement extbf{L}earning reward world model with a self-built VLA approach. Our framework proceeds in a three-stage paradigm: In the first stage, we propose a VLA architecture and pretrain the VLA policy via imitation learning. In the second stage, we construct a lightweight reward world model via inverse reinforcement learning to enable efficient close-loop reward computation. To further enhance planning performance, finally, we design specialized reward world model guidence reinforcement learning via PPO(Proximal Policy Optimization) to effectively balance the safety incidents, comfortable driving, and traffic efficiency. Our approach achieves state-of-the-art performance in NAVSIM v2 end-to-end driving benchmark, 1st runner up in CVPR2025 Autonomous Grand Challenge. We hope that our framework will accelerate VLA research in close-loop autonomous driving.
Problem

Research questions and friction points this paper is trying to address.

Improving Vision-Language-Action models for autonomous driving
Overcoming imitation learning limitations in open-loop setups
Reducing reliance on high-fidelity sensor simulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLA policy pretrained via imitation learning
Lightweight reward world model via IRL
PPO-based reinforcement learning for driving
🔎 Similar Papers
No similar papers found.
A
Anqing Jiang
Bosch Corporate Research, Shanghai, China
Y
Yu Gao
Bosch Corporate Research, Shanghai, China
Yiru Wang
Yiru Wang
University of Pittsburgh
Econometrics
Z
Zhigang Sun
Bosch Corporate Research, Shanghai, China
S
Shuo Wang
Bosch Corporate Research, Shanghai, China
Y
Yuwen Heng
Bosch Corporate Research, Shanghai, China
H
Hao Sun
Bosch Corporate Research, Shanghai, China
S
Shichen Tang
Bosch Corporate Research, Shanghai, China
L
Lijuan Zhu
Bosch Corporate Research, Shanghai, China
J
Jinhao Chai
School of Communication and Information Engineering, Shanghai University
J
Jijun Wang
AIR, Tsinghua University, Beijing
Z
Zichong Gu
School of Communication and Information Engineering, Shanghai University
H
Hao Jiang
School of Mechanical Engineering, Shanghai Jiao Tong University
L
Li Sun
Bosch Mobility Solutions, Robert Bosch GmbH, Suzhou