World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address performance degradation of Vision-Language-Action (VLA) models under data scarcity, non-resettable real-world environments, and ambiguous task completion criteria, this paper proposes a world-model-based reinforcement learning post-training framework. Our method constructs a video-driven world simulator enabling low-cost, resettable virtual interaction, and tightly couples it with a Vision-Language Model (VLM) to perform state inference, continuous reward estimation, and action termination prediction—thereby circumventing the high-risk, high-cost constraints of physical deployment. With only five expert demonstrations per task, our approach significantly improves success rates on complex robotic manipulation tasks, effectively alleviating three key bottlenecks: data inefficiency, safety constraints, and redundant actions. The core innovation lies in the first integration of a video-based world model with a VLM-guided immediate feedback mechanism for VLA policy optimization, enabling efficient and safe policy refinement within a real-virtual closed loop.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models trained via imitation learning suffer from significant performance degradation in data-scarce scenarios due to their reliance on large-scale demonstration datasets. Although reinforcement learning (RL)-based post-training has proven effective in addressing data scarcity, its application to VLA models is hindered by the non-resettable nature of real-world environments. This limitation is particularly critical in high-risk domains such as industrial automation, where interactions often induce state changes that are costly or infeasible to revert. Furthermore, existing VLA approaches lack a reliable mechanism for detecting task completion, leading to redundant actions that reduce overall task success rates. To address these challenges, we propose World-Env, an RL-based post-training framework that replaces physical interaction with a low-cost, world model-based virtual simulator. World-Env consists of two key components: (1) a video-based world simulator that generates temporally consistent future visual observations, and (2) a vision-language model (VLM)-guided instant reflector that provides continuous reward signals and predicts action termination. This simulated environment enables VLA models to safely explore and generalize beyond their initial imitation learning distribution. Our method achieves notable performance gains with as few as five expert demonstrations per task. Experiments on complex robotic manipulation tasks demonstrate that World-Env effectively overcomes the data inefficiency, safety constraints, and inefficient execution of conventional VLA models that rely on real-world interaction, offering a practical and scalable solution for post-training in resource-constrained settings.
Problem

Research questions and friction points this paper is trying to address.

Addressing VLA model performance degradation in data-scarce scenarios
Overcoming non-resettable environment limitations in high-risk domains
Solving unreliable task completion detection causing redundant actions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses world model simulator for safe VLA training
Integrates VLM-guided reflector for reward signals
Enables RL post-training with minimal expert demonstrations
🔎 Similar Papers
No similar papers found.
J
Junjin Xiao
School of Computer Science and Engineering, Sun Yat-sen University, China
Yandan Yang
Yandan Yang
BIGAI (Beijing Institute for General Artificial Intelligence)
Computer VisionGenerationEmbodied AI
Xinyuan Chang
Xinyuan Chang
Xi'an Jiaotong University; Alibaba-Amap
Autonomous Driving,Computer Vision
R
Ronghan Chen
Amap, Alibaba Group
F
Feng Xiong
Amap, Alibaba Group
M
Mu Xu
Amap, Alibaba Group
Wei-Shi Zheng
Wei-Shi Zheng
Professor @ SUN YAT-SEN UNIVERSITY
Computer VisionPattern RecognitionMachine Learning
Q
Qing Zhang
School of Computer Science and Engineering, Sun Yat-sen University, China