EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models

📅 2025-05-14

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This paper addresses the lack of physically grounded and action-consistent evaluation criteria for Embodied World Models (EWMs) by introducing EWMBench—the first dedicated benchmark for EWM assessment. Methodologically, it systematically defines and quantifies three orthogonal quality dimensions: visual scene consistency, motion plausibility, and semantic alignment; and establishes a multi-dimensional evaluation framework integrating geometric reasoning, trajectory analysis, cross-modal alignment, and differentiable physics simulation. Experiments reveal critical deficiencies in state-of-the-art EWMs: a 42% drop in motion coherence and a 57% decline in instruction-action semantic fidelity. The open-sourced dataset and toolchain have catalyzed community-driven improvements. EWMBench establishes a new paradigm for evaluating embodied intelligence that is both physically grounded and action-executable.

Technology Category

Application Category

📝 Abstract

Recent advances in creative AI have enabled the synthesis of high-fidelity images and videos conditioned on language instructions. Building on these developments, text-to-video diffusion models have evolved into embodied world models (EWMs) capable of generating physically plausible scenes from language commands, effectively bridging vision and action in embodied AI applications. This work addresses the critical challenge of evaluating EWMs beyond general perceptual metrics to ensure the generation of physically grounded and action-consistent behaviors. We propose the Embodied World Model Benchmark (EWMBench), a dedicated framework designed to evaluate EWMs based on three key aspects: visual scene consistency, motion correctness, and semantic alignment. Our approach leverages a meticulously curated dataset encompassing diverse scenes and motion patterns, alongside a comprehensive multi-dimensional evaluation toolkit, to assess and compare candidate models. The proposed benchmark not only identifies the limitations of existing video generation models in meeting the unique requirements of embodied tasks but also provides valuable insights to guide future advancements in the field. The dataset and evaluation tools are publicly available at https://github.com/AgibotTech/EWMBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating embodied world models for physical plausibility

Assessing visual scene and motion consistency in AI

Ensuring semantic alignment in text-to-video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-to-video diffusion models for embodied world models

Multi-dimensional evaluation toolkit for visual and motion quality

Curated dataset for scene consistency and semantic alignment

🔎 Similar Papers

M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes