WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
Existing benchmarks for embodied world models are largely confined to purely visual, offline, and simulated settings, limiting their ability to comprehensively evaluate complex embodied intelligent systems. This work proposes a novel evaluation benchmark that systematically extends assessment capabilities across three dimensions: modality (integrating vision and touch), functionality (supporting interactive policy optimization), and platform (spanning both simulation and real robots). Built upon a standardized protocol, the benchmark unifies multimodal perception modeling, action-conditioned future prediction, and cross-platform deployment. It enables, for the first time, a unified and scalable evaluation of world models in terms of perceptual fidelity, interactive utility, and cross-platform performance, thereby offering a comprehensive testing framework for embodied intelligence research.
📝 Abstract
World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.
Problem

Research questions and friction points this paper is trying to address.

world models
embodied intelligence
benchmarking
multimodal perception
real-world robotics
Innovation

Methods, ideas, or system contributions that make the work stand out.

embodied world models
multimodal perception
interactive RL environments
cross-platform evaluation
visuotactile modalities
Yu Shang
Yu Shang
Department of Electronic Engineering, Tsinghua University
Multimodal LearningLLM AgentRecommender System
Yinzhou Tang
Yinzhou Tang
Tsinghua University
Y
Yiding Ma
Tsinghua University
Zhuohang Li
Zhuohang Li
Vanderbilt University
Lei Jin
Lei Jin
Tsinghua University; School of Information Sciences, University of Pittsburgh
SecurityPrivacyAccess ControlAuthenticationSocial Network
W
Weikang Su
Tsinghua University
X
Xin Jin
Tsinghua University
Z
Zhaolu Wang
Tsinghua University
Z
Ziyou Wang
Tsinghua University
Xin Zhang
Xin Zhang
Tsinghua University, Manifold AI
LLMMLLMWorld ModelEmbodied Intelligence
Haisheng Su
Haisheng Su
SenseTime; SenseAuto; SJTU
computer visionvideo understandingautonomous drivingembodied intelligence
W
Weizhen He
Zhejiang University
W
Wei Wu
Tsinghua University
H
Haoyi Duan
Stanford University
Gordon Wetzstein
Gordon Wetzstein
Associate Professor of Electrical Engineering and Computer Science, Stanford University
Computational ImagingComputational DisplaysComputational OpticsNeural Rendering
Xihui Liu
Xihui Liu
University of Hong Kong, UC Berkeley, CUHK, Tsinghua University
Computer VisionDeep Learning
Dhruv Shah
Dhruv Shah
Princeton University, Google DeepMind
Robot LearningArtificial IntelligenceRoboticsReinforcement Learning
Zhaoxiang Zhang
Zhaoxiang Zhang
Institute of Automation, Chinese Academy of Sciences
Computer VisionPattern RecognitionBiologically-inspired Learning
Zhibo Chen
Zhibo Chen
Professor@University of Science and Technology of China
Generative AIvisual signal representationvideo codingvideo analysis and processing
Jun Zhu
Jun Zhu
Professor of Computer Science, Tsinghua University
Machine LearningBayesian MethodsDeep Generative ModelsAdversarial RobustnessReinforcement Learning
Y
Yonghong Tian
Peking University
Tat-Seng Chua
Tat-Seng Chua
National University of Singapore
Multimedia Information RetrievalLive Social Media Analysis
Wenwu Zhu
Wenwu Zhu
Professor, Computer Science, Tsinghua Univerisity
Multimedia ComputingNetwork Representation Learning
Chen Gao
Chen Gao
BNRist, Tsinghua University
Data MiningLLM AgentEmbodied AI
Yong Li
Yong Li
Professor, Electronic Engineering, Tsinghua University
Urban ScienceData MiningAI for Science