WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the lack of a unified multidimensional evaluation framework for interactive video world models. To this end, we introduce a multiround benchmark comprising 289 test cases and 1,058 interaction rounds, proposing the first comprehensive evaluation protocol that integrates multiple viewpoints, interaction types, and input modalities—including text, 6-DoF poses, and discrete actions. Model capabilities are systematically assessed across five dimensions: video quality, scene adherence, interaction fidelity, consistency, and physical plausibility. We design 22 automated submetrics leveraging specialized vision models and multimodal large language models, validated through human evaluation. Experiments on 20 state-of-the-art models reveal that none excel uniformly across all dimensions, offering fine-grained capability diagnostics and highlighting open challenges in the field.

📝 Abstract

Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi-turn benchmark for interactive world model evaluation along five dimensions, namely video quality, setting adherence, interaction adherence, consistency, and physics compliance. WBench contains 289 test cases and 1,058 interaction turns, where each case specifies a world setting and a multi-turn interaction sequence, covering diverse scenes, styles, subjects, and both first- and third-person perspectives, together with four interaction types, including navigation, subject action, event editing, and perspective switching. For navigation, WBench unifies text, 6-DoF pose, and discrete-action control, enabling evaluation of models with different native input interfaces. Evaluation uses 22 automatic sub-metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments. Across 20 state-of-the-art models, we find that no single model performs strongly across all dimensions. We provide detailed diagnostic insights into the characteristic strengths, weaknesses, and open challenges of each model. Code and data are available at https://github.com/meituan-longcat/WBench.

Problem

Research questions and friction points this paper is trying to address.

interactive world models

benchmark

multi-turn evaluation

video understanding

systematic evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive world models

multi-turn benchmark

video generation evaluation