🤖 AI Summary
To address the challenges of time-consuming, non-scalable, and unsafe evaluation of robotic manipulation policies on real hardware, this paper proposes WorldEval, a world-model-based online evaluation framework. Methodologically, it introduces (1) Policy2Vec—a technique that repurposes video generation models into action-controllable world simulators—and (2) the first end-to-end evaluation pipeline supporting automatic policy ranking and real-time hazardous action interception. The framework integrates latent-space action encoding, real-to-sim performance correlation modeling, and a lightweight online evaluation mechanism. Evaluated on real-world grasping tasks, WorldEval achieves strong correlation with physical deployment performance (Spearman ρ > 0.92), accelerates evaluation by 20× compared to real-hardware testing, and significantly outperforms real-to-sim baselines. Moreover, it successfully detects and blocks multiple categories of high-risk manipulations, enhancing safety and reliability in policy assessment.
📝 Abstract
The field of robotics has made significant strides toward developing generalist robot manipulation policies. However, evaluating these policies in real-world scenarios remains time-consuming and challenging, particularly as the number of tasks scales and environmental conditions change. In this work, we demonstrate that world models can serve as a scalable, reproducible, and reliable proxy for real-world robot policy evaluation. A key challenge is generating accurate policy videos from world models that faithfully reflect the robot actions. We observe that directly inputting robot actions or using high-dimensional encoding methods often fails to generate action-following videos. To address this, we propose Policy2Vec, a simple yet effective approach to turn a video generation model into a world simulator that follows latent action to generate the robot video. We then introduce WorldEval, an automated pipeline designed to evaluate real-world robot policies entirely online. WorldEval effectively ranks various robot policies and individual checkpoints within a single policy, and functions as a safety detector to prevent dangerous actions by newly developed robot models. Through comprehensive paired evaluations of manipulation policies in real-world environments, we demonstrate a strong correlation between policy performance in WorldEval and real-world scenarios. Furthermore, our method significantly outperforms popular methods such as real-to-sim approach.