WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL

📅 2026-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of applying reinforcement learning (RL) to vision-language-action (VLA) models, where high real-world robot interaction costs and world model limitations—such as hallucination and long-horizon error accumulation—distort policy optimization signals. To mitigate these issues, the authors propose WoVR, a novel framework that explicitly controls the impact of world model hallucinations on RL. WoVR leverages action-conditioned video generation and keyframe-initialized rollouts to reduce effective error depth and introduces a co-evolution mechanism between the world model and policy to maintain dynamic alignment. This approach enables stable and efficient post-training of VLA policies, achieving significant performance gains: average success rates on the LIBERO benchmark improve from 39.95% to 69.2%, and real-robot task success rises from 61.7% to 91.7%.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) promises to unlock capabilities beyond imitation learning for Vision-Language-Action (VLA) models, but its requirement for massive real-world interaction prevents direct deployment on physical robots. Recent work attempts to use learned world models as simulators for policy optimization, yet closed-loop imagined rollouts inevitably suffer from hallucination and long-horizon error accumulation. Such errors do not merely degrade visual fidelity; they corrupt the optimization signal, encouraging policies to exploit model inaccuracies rather than genuine task progress. We propose WoVR, a reliable world-model-based reinforcement learning framework for post-training VLA policies. Instead of assuming a faithful world model, WoVR explicitly regulates how RL interacts with imperfect imagined dynamics. It improves rollout stability through a controllable action-conditioned video world model, reshapes imagined interaction to reduce effective error depth via Keyframe-Initialized Rollouts, and maintains policy-simulator alignment through World Model-Policy co-evolution. Extensive experiments on LIBERO benchmarks and real-world robotic manipulation demonstrate that WoVR enables stable long-horizon imagined rollouts and effective policy optimization, improving average LIBERO success from 39.95% to 69.2% (+29.3 points) and real-robot success from 61.7% to 91.7% (+30.0 points). These results show that learned world models can serve as practical simulators for reinforcement learning when hallucination is explicitly controlled.
Problem

Research questions and friction points this paper is trying to address.

world models
reinforcement learning
Vision-Language-Action
hallucination
sim-to-real
Innovation

Methods, ideas, or system contributions that make the work stand out.

World Models
Reinforcement Learning
Vision-Language-Action
Keyframe-Initialized Rollouts
Policy-Simulator Co-evolution
🔎 Similar Papers
No similar papers found.