🤖 AI Summary
This work addresses the limitations of behavioral cloning—namely, poor task transferability and high dependence on large, expert-labeled datasets—by proposing a vision-based world model framework for autonomous planning. Methodologically, it introduces an action-conditioned visual world model trained on minimal unstructured play data to capture environment dynamics; a diffusion-model-driven action sampler to mitigate hallucination in multi-step prediction; and the first integration of Monte Carlo Tree Search (MCTS) with zeroth-order model predictive control (MPC) for end-to-end, long-horizon visual–action joint optimization. An optional reward model can be incorporated to enhance planning robustness. Evaluated on three real-robot manipulation tasks, the framework significantly outperforms behavioral cloning baselines in success rate and cross-task generalization. It establishes a new paradigm for data-efficient, generalizable robotic planning.
📝 Abstract
Robots must understand their environment from raw sensory inputs and reason about the consequences of their actions in it to solve complex tasks. Behavior Cloning (BC) leverages task-specific human demonstrations to learn this knowledge as end-to-end policies. However, these policies are difficult to transfer to new tasks, and generating training data is challenging because it requires careful demonstrations and frequent environment resets. In contrast to such policy-based view, in this paper we take a model-based approach where we collect a few hours of unstructured easy-to-collect play data to learn an action-conditioned visual world model, a diffusion-based action sampler, and optionally a reward model. The world model -- in combination with the action sampler and a reward model -- is then used to optimize long sequences of actions with a Monte Carlo Tree Search (MCTS) planner. The resulting plans are executed on the robot via a zeroth-order Model Predictive Controller (MPC). We show that the action sampler mitigates hallucinations of the world model during planning and validate our approach on 3 real-world robotic tasks with varying levels of planning and modeling complexity. Our experiments support the hypothesis that planning leads to a significant improvement over BC baselines on a standard manipulation test environment.