🤖 AI Summary
Existing benchmarks for embodied world models are largely confined to purely visual, offline, and simulated settings, limiting their ability to comprehensively evaluate complex embodied intelligent systems. This work proposes a novel evaluation benchmark that systematically extends assessment capabilities across three dimensions: modality (integrating vision and touch), functionality (supporting interactive policy optimization), and platform (spanning both simulation and real robots). Built upon a standardized protocol, the benchmark unifies multimodal perception modeling, action-conditioned future prediction, and cross-platform deployment. It enables, for the first time, a unified and scalable evaluation of world models in terms of perceptual fidelity, interactive utility, and cross-platform performance, thereby offering a comprehensive testing framework for embodied intelligence research.
📝 Abstract
World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.