RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation models are primarily evaluated on perceptual quality and temporal coherence, overlooking their capacity for cognitive-level rule-based reasoning. To address this gap, we introduce RULER-Bench—the first systematic benchmark for evaluating rule reasoning in video generation, covering both text-to-video and image-to-video paradigms. It comprises 40 tasks across six categories and 622 annotated instances. We propose a fine-grained, four-dimensional evaluation framework—assessing logical, spatiotemporal, causal, and symbolic consistency—leveraging GPT-4o for automated scoring (85% human alignment) augmented by expert validation. Experimental results reveal that state-of-the-art models achieve only 48.87% rule consistency, exposing a critical reasoning bottleneck. RULER-Bench establishes a foundational, interpretable, and reasoning-aware evaluation standard for next-generation visual foundation models, accompanied by open-source data and evaluation tools.

Technology Category

Application Category

📝 Abstract
Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.
Problem

Research questions and friction points this paper is trying to address.

Evaluates rule-based reasoning in video generation models
Introduces benchmark for cognitive rule assessment
Highlights gaps in reasoning capabilities of current models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces RULER-Bench benchmark for rule-based reasoning evaluation
Uses GPT-4o for automated scoring aligned with human judgments
Covers 40 tasks across six rule categories in video generation
🔎 Similar Papers
No similar papers found.
X
Xuming He
Zhejiang University
Z
Zehao Fan
Zhejiang University
Hengjia Li
Hengjia Li
Zhejiang University
image generationvideo generation
F
Fan Zhuo
Zhejiang University
H
Hankun Xu
Zhejiang University
S
Senlin Cheng
Ant Group
Di Weng
Di Weng
School of Software Technology, Zhejiang University
VisualizationVisual AnalyticsHuman-Computer Interaction
C
Can Ye
Ant Group
B
Boxi Wu
Zhejiang University