V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Current generative video models lack systematic evaluation of zero-shot reasoning capabilities. Method: We introduce the first multidimensional, verifiable, and reproducible video reasoning benchmark—covering structured problem solving, spatial cognition, pattern inference, and physical dynamics—built via hybrid synthetic and real-world image sequences to ensure unambiguous task definitions; we further propose the Chain-of-Frames reasoning analysis paradigm to quantitatively characterize the impact of temporal sequence length on reasoning performance. Results: Evaluating six state-of-the-art video generation models reveals substantial capability disparities across dimensions and pervasive hallucination patterns, particularly in physical and spatial reasoning. Our benchmark provides an empirically grounded, scalable framework for probing model reasoning mechanisms and advancing human-aligned video understanding.

Technology Category

Application Category

📝 Abstract

Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.

Problem

Research questions and friction points this paper is trying to address.

Evaluating video generation models' reasoning abilities systematically

Assessing reasoning across structured, spatial, pattern, and physical dimensions

Providing unified benchmark for reproducible video reasoning evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark assesses video reasoning across dimensions

Built from synthetic and real-world image sequences

Provides reproducible scalable answer-verifiable tasks

🔎 Similar Papers

TVBench: Redesigning Video-Language Evaluation

2024-10-10arXiv.orgCitations: 0

TikTok

San Jose, California

Sr. Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence