Ctrl-World: A Controllable Generative World Model for Robot Manipulation

📅 2025-10-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating the generalization of general-purpose robotic policies on novel objects and instructions is challenging due to poor interpretability and high real-world deployment costs. To address this, we propose the Controllable Generative World Model (CGWM), a world model that integrates pose-conditioned memory retrieval to ensure long-horizon consistency, frame-level action-conditioned generation, and multi-view video prediction—enabling high-fidelity, multi-step, and controllable interaction simulation. Trained on the DROID dataset (95k trajectories across 564 scenes), CGWM accurately ranks policy performance without physical execution. Furthermore, leveraging its generated successful trajectories for supervised fine-tuning improves policy success rates by 44.7%. To our knowledge, CGWM is the first world model enabling coherent, controllable, and fine-grained robotic manipulation simulation in imagination space, significantly enhancing the efficiency and scalability of policy evaluation and optimization.

Technology Category

Application Category

📝 Abstract
Generalist robot policies can now perform a wide range of manipulation skills, but evaluating and improving their ability with unfamiliar objects and instructions remains a significant challenge. Rigorous evaluation requires a large number of real-world rollouts, while systematic improvement demands additional corrective data with expert labels. Both of these processes are slow, costly, and difficult to scale. World models offer a promising, scalable alternative by enabling policies to rollout within imagination space. However, a key challenge is building a controllable world model that can handle multi-step interactions with generalist robot policies. This requires a world model compatible with modern generalist policies by supporting multi-view prediction, fine-grained action control, and consistent long-horizon interactions, which is not achieved by previous works. In this paper, we make a step forward by introducing a controllable multi-view world model that can be used to evaluate and improve the instruction-following ability of generalist robot policies. Our model maintains long-horizon consistency with a pose-conditioned memory retrieval mechanism and achieves precise action control through frame-level action conditioning. Trained on the DROID dataset (95k trajectories, 564 scenes), our model generates spatially and temporally consistent trajectories under novel scenarios and new camera placements for over 20 seconds. We show that our method can accurately rank policy performance without real-world robot rollouts. Moreover, by synthesizing successful trajectories in imagination and using them for supervised fine-tuning, our approach can improve policy success by 44.7%.
Problem

Research questions and friction points this paper is trying to address.

Evaluating generalist robot policies with unfamiliar objects and instructions
Building controllable world models for multi-step robot interactions
Reducing costly real-world rollouts for robot policy improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Controllable multi-view world model for robot policies
Pose-conditioned memory retrieval for long-horizon consistency
Frame-level action conditioning enables precise control