🤖 AI Summary
This work addresses the challenge of applying reinforcement learning (RL) to vision-language-action (VLA) models, where high real-world robot interaction costs and world model limitations—such as hallucination and long-horizon error accumulation—distort policy optimization signals. To mitigate these issues, the authors propose WoVR, a novel framework that explicitly controls the impact of world model hallucinations on RL. WoVR leverages action-conditioned video generation and keyframe-initialized rollouts to reduce effective error depth and introduces a co-evolution mechanism between the world model and policy to maintain dynamic alignment. This approach enables stable and efficient post-training of VLA policies, achieving significant performance gains: average success rates on the LIBERO benchmark improve from 39.95% to 69.2%, and real-robot task success rises from 61.7% to 91.7%.
📝 Abstract
Reinforcement learning (RL) promises to unlock capabilities beyond imitation learning for Vision-Language-Action (VLA) models, but its requirement for massive real-world interaction prevents direct deployment on physical robots. Recent work attempts to use learned world models as simulators for policy optimization, yet closed-loop imagined rollouts inevitably suffer from hallucination and long-horizon error accumulation. Such errors do not merely degrade visual fidelity; they corrupt the optimization signal, encouraging policies to exploit model inaccuracies rather than genuine task progress. We propose WoVR, a reliable world-model-based reinforcement learning framework for post-training VLA policies. Instead of assuming a faithful world model, WoVR explicitly regulates how RL interacts with imperfect imagined dynamics. It improves rollout stability through a controllable action-conditioned video world model, reshapes imagined interaction to reduce effective error depth via Keyframe-Initialized Rollouts, and maintains policy-simulator alignment through World Model-Policy co-evolution. Extensive experiments on LIBERO benchmarks and real-world robotic manipulation demonstrate that WoVR enables stable long-horizon imagined rollouts and effective policy optimization, improving average LIBERO success from 39.95% to 69.2% (+29.3 points) and real-robot success from 61.7% to 91.7% (+30.0 points). These results show that learned world models can serve as practical simulators for reinforcement learning when hallucination is explicitly controlled.