WorldModelBench: Judging Video Generation Models As World Models

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation benchmarks overemphasize visual fidelity while neglecting critical world-model capabilities—such as physical consistency and instruction following—rendering them inadequate for decision-making applications like robotics and autonomous driving. To address this gap, we introduce the first benchmark explicitly designed to evaluate world-model competencies in video generation. Our method introduces a novel evaluation dimension grounded in violation detection of physical conservation laws. We construct a high-quality discriminative dataset comprising 67K human-annotated samples and train an automated assessment model achieving superior accuracy—8.6% higher than GPT-4o. By integrating reward modeling (RLHF), physics-constrained modeling, and multi-dimensional metric design, our framework significantly enhances the physical plausibility and instruction alignment of generated videos.

Technology Category

Application Category

📝 Abstract
Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving. However, current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality, ignoring important factors to world models such as physics adherence. To bridge this gap, we propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application-driven domains. WorldModelBench offers two key advantages: (1) Against to nuanced world modeling violations: By incorporating instruction-following and physics-adherence dimensions, WorldModelBench detects subtle violations, such as irregular changes in object size that breach the mass conservation law - issues overlooked by prior benchmarks. (2) Aligned with large-scale human preferences: We crowd-source 67K human labels to accurately measure 14 frontier models. Using our high-quality human labels, we further fine-tune an accurate judger to automate the evaluation procedure, achieving 8.6% higher average accuracy in predicting world modeling violations than GPT-4o with 2B parameters. In addition, we demonstrate that training to align human annotations by maximizing the rewards from the judger noticeably improve the world modeling capability. The website is available at https://worldmodelbench-team.github.io.
Problem

Research questions and friction points this paper is trying to address.

Evaluates video generation models as world models
Assesses physics adherence and instruction-following in videos
Improves model accuracy with human-labeled data
Innovation

Methods, ideas, or system contributions that make the work stand out.

WorldModelBench evaluates video world models rigorously
Incorporates physics adherence and instruction-following dimensions
Uses human labels to fine-tune an accurate judger
🔎 Similar Papers
No similar papers found.