TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing video generation evaluation benchmarks emphasize visual fidelity while neglecting higher-order reasoning capabilities. This work introduces TiViBench—the first hierarchical reasoning benchmark for image-to-video (I2V) generation models—assessing four dimensions: structural understanding, spatial reasoning, symbolic logic, and action planning, across 24 tasks with three difficulty levels. We propose VideoTPO, a training-free test-time optimization method that leverages large language models to perform chain-of-frame reasoning over generated videos and conduct self-preference ranking. Experiments reveal that commercial models (e.g., Sora 2, Veo 3.1) exhibit strong reasoning abilities, whereas open-source models are constrained by training scale and data diversity. Applying VideoTPO significantly enhances their reasoning performance, uncovering latent potential previously underutilized. TiViBench thus establishes a rigorous, multidimensional framework for evaluating and advancing I2V reasoning, while VideoTPO offers a lightweight, generalizable approach to unlock reasoning capabilities without retraining.

Technology Category

Application Category

📝 Abstract

The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.

Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning capabilities in video generation models beyond visual fidelity

Assessing physical plausibility and logical consistency in video generation tasks

Developing benchmarks and methods to enhance think-in-video reasoning abilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical benchmark evaluates video reasoning capabilities

Test-time strategy enhances reasoning without additional training

LLM self-analysis identifies strengths and weaknesses of generations

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

2024-06-19arXiv.orgCitations: 5

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence