VLAW: Iterative Co-Improvement of Vision-Language-Action Policy and World Model

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing vision-language-action (VLA) models are hindered in contact-rich tasks by the scarcity of real-world interaction data and insufficient physical fidelity in world models. This work proposes an iterative co-optimization algorithm that, for the first time, enables online co-evolution of VLA policies and an action-conditioned video-generation world model. The approach continuously refines the world model using real robot rollouts and leverages the improved model to generate high-fidelity synthetic data that augments policy training. By overcoming the traditional limitation of world models—namely, their lack of failure examples and fine-grained physical details—the method significantly enhances policy performance. Evaluated on real robots, it achieves an absolute 39.2% improvement in downstream task success rate, with synthetic data alone contributing an 11.6% gain.

Technology Category

Application Category

📝 Abstract

The goal of this paper is to improve the performance and reliability of vision-language-action (VLA) models through iterative online interaction. Since collecting policy rollouts in the real world is expensive, we investigate whether a learned simulator-specifically, an action-conditioned video generation model-can be used to generate additional rollout data. Unfortunately, existing world models lack the physical fidelity necessary for policy improvement: they are predominantly trained on demonstration datasets that lack coverage of many different physical interactions (particularly failure cases) and struggle to accurately model small yet critical physical details in contact-rich object manipulation. We propose a simple iterative improvement algorithm that uses real-world roll-out data to improve the fidelity of the world model, which can then, in turn, be used to generate supplemental synthetic data for improving the VLA model. In our experiments on a real robot, we use this approach to improve the performance of a state-of-the-art VLA model on multiple downstream tasks. We achieve a 39.2% absolute success rate improvement over the base policy and 11.6% improvement from training with the generated synthetic rollouts. Videos can be found at this anonymous website: https://sites.google.com/view/vla-w

Problem

Research questions and friction points this paper is trying to address.

vision-language-action

world model

physical fidelity

policy improvement

contact-rich manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language-action

world model

iterative co-improvement

action-conditioned video generation

policy learning

🔎 Similar Papers

Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs

2024-07-26Citations: 2

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study

2024-01-12arXiv.orgCitations: 0