RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

While existing video world models can generate visually realistic sequences, they lack systematic evaluation of physical plausibility and executable actions, limiting their applicability to real-world robotics. To address this gap, this work introduces RoboWM-Bench—the first embodied execution–centered benchmark for evaluating video world models. It maps human and robotic manipulation videos into executable action sequences and validates task success through real robot execution. The study pioneers a physics-grounded evaluation paradigm that decouples visual fidelity from task effectiveness, establishing an end-to-end embodied assessment pipeline. Experiments reveal significant deficiencies in mainstream models regarding spatial reasoning, contact stability, and deformable object dynamics; furthermore, fine-tuning yields only marginal improvements, highlighting critical challenges and new directions for robot-oriented video generation.

Technology Category

Application Category

📝 Abstract

Recent advances in large-scale video world models have enabled increasingly realistic future prediction, raising the prospect of leveraging imagined videos for robot learning. However, visual realism does not imply physical plausibility, and behaviors inferred from generated videos may violate dynamics and fail when executed by embodied agents. Existing benchmarks begin to incorporate notions of physical plausibility, but they largely remain perception- or diagnostic-oriented and do not systematically evaluate whether predicted behaviors can be translated into executable actions that complete the intended task. To address this gap, we introduce RoboWM-Bench, a manipulation-centric benchmark for embodiment-grounded evaluation of video world models. RoboWM-Bench converts generated behaviors from both human-hand and robotic manipulation videos into embodied action sequences and validates them through robotic execution. The benchmark spans diverse manipulation scenarios and establishes a unified protocol for consistent and reproducible evaluation. Using RoboWM-Bench, we evaluate state-of-the-art video world models and find that reliably generating physically executable behaviors remains an open challenge. Common failure modes include errors in spatial reasoning, unstable contact prediction, and non-physical deformations. While finetuning on manipulation data yields improvements, physical inconsistencies still persist, suggesting opportunities for more physically grounded video generation for robots.

Problem

Research questions and friction points this paper is trying to address.

world models

robotic manipulation

physical plausibility

embodied action

video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

world models

robotic manipulation

embodiment-grounded evaluation