WorldEval: World Model as Real-World Robot Policies Evaluator

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

To address the challenges of time-consuming, non-scalable, and unsafe evaluation of robotic manipulation policies on real hardware, this paper proposes WorldEval, a world-model-based online evaluation framework. Methodologically, it introduces (1) Policy2Vec—a technique that repurposes video generation models into action-controllable world simulators—and (2) the first end-to-end evaluation pipeline supporting automatic policy ranking and real-time hazardous action interception. The framework integrates latent-space action encoding, real-to-sim performance correlation modeling, and a lightweight online evaluation mechanism. Evaluated on real-world grasping tasks, WorldEval achieves strong correlation with physical deployment performance (Spearman ρ > 0.92), accelerates evaluation by 20× compared to real-hardware testing, and significantly outperforms real-to-sim baselines. Moreover, it successfully detects and blocks multiple categories of high-risk manipulations, enhancing safety and reliability in policy assessment.

Technology Category

Application Category

📝 Abstract

The field of robotics has made significant strides toward developing generalist robot manipulation policies. However, evaluating these policies in real-world scenarios remains time-consuming and challenging, particularly as the number of tasks scales and environmental conditions change. In this work, we demonstrate that world models can serve as a scalable, reproducible, and reliable proxy for real-world robot policy evaluation. A key challenge is generating accurate policy videos from world models that faithfully reflect the robot actions. We observe that directly inputting robot actions or using high-dimensional encoding methods often fails to generate action-following videos. To address this, we propose Policy2Vec, a simple yet effective approach to turn a video generation model into a world simulator that follows latent action to generate the robot video. We then introduce WorldEval, an automated pipeline designed to evaluate real-world robot policies entirely online. WorldEval effectively ranks various robot policies and individual checkpoints within a single policy, and functions as a safety detector to prevent dangerous actions by newly developed robot models. Through comprehensive paired evaluations of manipulation policies in real-world environments, we demonstrate a strong correlation between policy performance in WorldEval and real-world scenarios. Furthermore, our method significantly outperforms popular methods such as real-to-sim approach.

Problem

Research questions and friction points this paper is trying to address.

Evaluating real-world robot policies is time-consuming and challenging.

World models need accurate policy videos reflecting robot actions.

Automated pipeline for online evaluation and safety detection of robot policies.

Innovation

Methods, ideas, or system contributions that make the work stand out.

World models as scalable robot policy evaluators

Policy2Vec for action-following video generation

WorldEval pipeline for online policy evaluation

🔎 Similar Papers

No similar papers found.