dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing approaches struggle to efficiently evaluate robotic policies across thousands of environments and tasks. This work proposes a scalable evaluation agent based on a discrete diffusion world model that unifies vision, language, and action into a multimodal token sequence and jointly predicts future observations and task progress through a single Transformer-based denoising network. Key innovations include the construction of a unified multimodal token space, the introduction of a sparse keyframe memory mechanism to enhance temporal modeling efficiency, and the design of a task-progress token enabling end-to-end automatic success determination. Evaluated on LIBERO, RoboTwin, and multiple real-world robotic tasks, the method significantly outperforms baseline approaches such as WorldEval, Ctrl-World, and WorldGym.

Technology Category

Application Category

📝 Abstract

Evaluating robotics policies across thousands of environments and thousands of tasks is infeasible with existing approaches. This motivates the need for a new methodology for scalable robotics policy evaluation. In this paper, we propose dWorldEval, which uses a discrete diffusion world model as a scalable evaluation proxy for robotics policies. Specifically, dWorldEval maps all modalities - including vision, language, and robotic actions - into a unified token space, modeling them via a single transformer-based denoising network. In this paper, we propose dWorldEval, using a discrete diffusion world model as a scalable evaluation proxy for robotics policy. Specifically, it maps all modalities, including vision, language, and robotics action into a unified token space, then denoises them with a single transformer network. Building on this architecture, we employ a sparse keyframe memory to maintain spatiotemporal consistency. We also introduce a progress token that indicates the degree of task completion. At inference, the model jointly predicts future observations and progress token, allowing automatically determine success when the progress reaches 1. Extensive experiments demonstrate that dWorldEval significantly outperforms previous approaches, i.e., WorldEval, Ctrl-World, and WorldGym, on LIBERO, RoboTwin, and multiple real-robot tasks. It paves the way for a new architectural paradigm in building world simulators for robotics evaluation at scale.

Problem

Research questions and friction points this paper is trying to address.

robotics policy evaluation

scalable evaluation

world model

discrete diffusion

task success assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete diffusion

world model

robotic policy evaluation