MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the prevailing overemphasis on visual fidelity in evaluating robotic world models, which often neglects the reliability of action-conditioned predictions. To bridge this gap, we introduce MiraBench—the first hierarchical benchmark centered on action-conditioned reliability—systematically assessing world models along three dimensions: physical consistency, action-following fidelity, and optimistic bias. Built upon over 16,000 human-annotated judgments, MiraBench encompasses vector- and text-conditioned models, both open- and closed-source systems, and a range of model scales, enabling reference-free analysis of physical and behavioral consistency. Experiments across twelve prominent models reveal a critical disconnect between visual fidelity and action reliability, demonstrate that model scale does not guarantee faithful action execution, and uncover a pervasive tendency toward over-optimistic predictions—thereby establishing a foundation for future diagnosis and improvement of world models.
📝 Abstract
Action-conditioned world models are increasingly used as scalable simulators for robot learning, yet current evaluations provide limited evidence that their predictions are reliable under the actions they condition on. Existing benchmarks largely emphasize visual fidelity, leaving unclear whether predicted futures are physically plausible, faithful to commanded actions, and calibrated to failure when actions should not succeed. We introduce \textsc{MiraBench}, a hierarchical benchmark that defines \emph{action-conditioned reliability} as a core evaluation target for robotic world models. MiraBench decomposes this target into three progressively demanding levels: \emph{Physics Adherence}, which evaluates reference-free physical consistency; \emph{Action-Following Fidelity}, which measures whether predictions respect task-relevant action inputs; and \emph{Optimism Bias Detection}, which probes the tendency to predict successful outcomes under failure-inducing actions. To support this evaluation, we curate a human-annotated corpus with over 16,000 judgments across tasks, failure categories, and leading world models. We evaluate 12 representative model configurations spanning vector-conditioned robotic world models, text-conditioned generative world models, open-weight systems, closed-source systems, and multiple model scales. Across this broad model landscape, MiraBench reveals three central findings: visual fidelity is a poor proxy for action fidelity; increasing model scale does not reliably improve action following; and optimism bias is pervasive across current systems. By shifting evaluation from appearance to action-conditioned reliability, MiraBench provides a diagnostic foundation for assessing and improving robotic world models as faithful simulators.
Problem

Research questions and friction points this paper is trying to address.

action-conditioned reliability
robotic world models
physics adherence
action-following fidelity
optimism bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

action-conditioned reliability
world models
robotic simulation
optimism bias
physics adherence
T
Tianzhuo Yang
Institute for Artificial Intelligence, Peking University
Z
Zihan Shen
Institute for Artificial Intelligence, Peking University
Z
Zirui Mi
Institute for Artificial Intelligence, Peking University
Z
Zhaoyi Zhang
Institute for Artificial Intelligence, Peking University
Jiayi Zhou
Jiayi Zhou
Peking University Ph.D Student
AI
J
Jiaming Ji
Institute for Artificial Intelligence, Peking University, Physis Lab
J
Juntao Dai
Institute for Artificial Intelligence, Peking University, Physis Lab
J
Jiawei Chen
Institute for Artificial Intelligence, Peking University
Boyuan Chen
Boyuan Chen
Peking University
AI SafetyAlignmentScalable OversightReasoning & MASReinforcement Learning
Yaodong Yang
Yaodong Yang
Boya (博雅) Assistant Professor at Peking University
Reinforcement LearningAI AlignmentEmbodied AI