WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing benchmarks for embodied world models are largely confined to purely visual, offline, and simulated settings, limiting their ability to comprehensively evaluate complex embodied intelligent systems. This work proposes a novel evaluation benchmark that systematically extends assessment capabilities across three dimensions: modality (integrating vision and touch), functionality (supporting interactive policy optimization), and platform (spanning both simulation and real robots). Built upon a standardized protocol, the benchmark unifies multimodal perception modeling, action-conditioned future prediction, and cross-platform deployment. It enables, for the first time, a unified and scalable evaluation of world models in terms of perceptual fidelity, interactive utility, and cross-platform performance, thereby offering a comprehensive testing framework for embodied intelligence research.

📝 Abstract

World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.

Problem

Research questions and friction points this paper is trying to address.

world models

embodied intelligence

benchmarking

multimodal perception

real-world robotics

Innovation

Methods, ideas, or system contributions that make the work stand out.

embodied world models

multimodal perception

interactive RL environments

cross-platform evaluation

visuotactile modalities

🔎 Similar Papers

Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI

2024-07-09IEEE/ASME transactions on mechatronicsCitations: 94

Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models

2024-05-15arXiv.orgCitations: 28