🤖 AI Summary
Existing action-conditioned video prediction models suffer from slow inference and difficulty in maintaining long-term physical consistency, limiting their scalability for policy training and evaluation in robotics. This work proposes the first interactive world model that jointly leverages consistency models for both image generation and latent-space dynamics modeling, enabling efficient simulation from moderate-scale robot interaction data. The method achieves stable, high-fidelity interactive simulation at 15 FPS on a single RTX 4090 GPU for over 10 minutes. Imitation policies trained exclusively on this simulated data match the performance of policies trained on real-world data of equivalent scale across multiple real-world tasks, with strong alignment between simulated and real-world policy performance.
📝 Abstract
Action-conditioned video prediction models (often referred to as world models) have shown strong potential for robotics applications, but existing approaches are often slow and struggle to capture physically consistent interactions over long horizons, limiting their usefulness for scalable robot policy training and evaluation. We present Interactive World Simulator, a framework for building interactive world models from a moderate-sized robot interaction dataset. Our approach leverages consistency models for both image decoding and latent-space dynamics prediction, enabling fast and stable simulation of physical interactions. In our experiments, the learned world models produce interaction-consistent pixel-level predictions and support stable long-horizon interactions for more than 10 minutes at 15 FPS on a single RTX 4090 GPU. Our framework enables scalable demonstration collection solely within the world models to train state-of-the-art imitation policies. Through extensive real-world evaluation across diverse tasks involving rigid objects, deformable objects, object piles, and their interactions, we find that policies trained on world-model-generated data perform comparably to those trained on the same amount of real-world data. Additionally, we evaluate policies both within the world models and in the real world across diverse tasks, and observe a strong correlation between simulated and real-world performance. Together, these results establish the Interactive World Simulator as a stable and physically consistent surrogate for scalable robotic data generation and faithful, reproducible policy evaluation.