EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models

πŸ“… 2025-05-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses the lack of physically grounded and action-consistent evaluation criteria for Embodied World Models (EWMs) by introducing EWMBenchβ€”the first dedicated benchmark for EWM assessment. Methodologically, it systematically defines and quantifies three orthogonal quality dimensions: visual scene consistency, motion plausibility, and semantic alignment; and establishes a multi-dimensional evaluation framework integrating geometric reasoning, trajectory analysis, cross-modal alignment, and differentiable physics simulation. Experiments reveal critical deficiencies in state-of-the-art EWMs: a 42% drop in motion coherence and a 57% decline in instruction-action semantic fidelity. The open-sourced dataset and toolchain have catalyzed community-driven improvements. EWMBench establishes a new paradigm for evaluating embodied intelligence that is both physically grounded and action-executable.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in creative AI have enabled the synthesis of high-fidelity images and videos conditioned on language instructions. Building on these developments, text-to-video diffusion models have evolved into embodied world models (EWMs) capable of generating physically plausible scenes from language commands, effectively bridging vision and action in embodied AI applications. This work addresses the critical challenge of evaluating EWMs beyond general perceptual metrics to ensure the generation of physically grounded and action-consistent behaviors. We propose the Embodied World Model Benchmark (EWMBench), a dedicated framework designed to evaluate EWMs based on three key aspects: visual scene consistency, motion correctness, and semantic alignment. Our approach leverages a meticulously curated dataset encompassing diverse scenes and motion patterns, alongside a comprehensive multi-dimensional evaluation toolkit, to assess and compare candidate models. The proposed benchmark not only identifies the limitations of existing video generation models in meeting the unique requirements of embodied tasks but also provides valuable insights to guide future advancements in the field. The dataset and evaluation tools are publicly available at https://github.com/AgibotTech/EWMBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating embodied world models for physical plausibility
Assessing visual scene and motion consistency in AI
Ensuring semantic alignment in text-to-video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-to-video diffusion models for embodied world models
Multi-dimensional evaluation toolkit for visual and motion quality
Curated dataset for scene consistency and semantic alignment