Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the high cost and limited scalability of existing robot policy post-training methods that rely on human intervention in the real world. The authors propose Hi-WM, a novel framework that leverages an action-conditioned world model as a reusable, interactive error-correction platform for the first time. Within this model, humans can roll back to failure states, perform branching interventions, and generate corrective trajectories for efficient policy fine-tuning. The approach supports closed-loop rollouts, state caching, and trajectory replay. Evaluated on three real-world manipulation tasks, Hi-WM improves success rates by an average of 37.9 percentage points over baselines, with world-model-based evaluations showing a strong correlation (r = 0.953) with actual real-world performance.

Technology Category

Application Category

📝 Abstract

Post-training is essential for turning pretrained generalist robot policies into reliable task-specific controllers, but existing human-in-the-loop pipelines remain tied to physical execution: each correction requires robot time, scene setup, resets, and operator supervision in the real world. Meanwhile, action-conditioned world models have been studied mainly for imagination, synthetic data generation, and policy evaluation. We propose \textbf{Human-in-the-World-Model (Hi-WM)}, a post-training framework that uses a learned world model as a reusable corrective substrate for failure-targeted policy improvement. A policy is first rolled out in closed loop inside the world model; when the rollout becomes incorrect or failure-prone, a human intervenes directly in the model to provide short corrective actions. Hi-WM caches intermediate states and supports rollback and branching, allowing a single failure state to be reused for multiple corrective continuations and yielding dense supervision around behaviors that the base policy handles poorly. The resulting corrective trajectories are then added back to the training set for post-training. We evaluate Hi-WM on three real-world manipulation tasks spanning both rigid and deformable object interaction, and on two policy backbones. Hi-WM improves real-world success by 37.9 points on average over the base policy and by 19.0 points over a world-model closed-loop baseline, while world-model evaluation correlates strongly with real-world performance (r = 0.953). These results suggest that world models can serve not only as generators or evaluators, but also as effective corrective substrates for scalable robot post-training.

Problem

Research questions and friction points this paper is trying to address.

robot post-training

human-in-the-loop

world model

policy correction

scalable learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

world model

human-in-the-loop

robot post-training