VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing video understanding benchmarks lack sufficient reasoning depth and are predominantly knowledge-driven with weak visual dependence, making them inadequate for evaluating the effectiveness of extended chain-of-thought (CoT) reasoning. Method: We introduce VideoReasonBench—the first benchmark explicitly designed for vision-centric, complex video reasoning. It features: (1) a three-level progressive evaluation framework emphasizing visual over knowledge-driven reasoning; (2) latent-state videos generated from fine-grained action sequences, supporting recall, reasoning, and prediction tasks; and (3) systematic analysis of test-time thinking budget impact on complex video reasoning. Contribution/Results: Comprehensive evaluation across 18 state-of-the-art multimodal large language models reveals stark performance gaps: GPT-4o achieves only 6.9% accuracy, while Gemini-2.5-Pro (reasoning-enhanced) attains 56.0%. Critically, extending CoT significantly improves performance on VideoReasonBench—but yields no gains on existing benchmarks—demonstrating its unique sensitivity to reasoning depth and visual grounding.

Technology Category

Application Category

📝 Abstract

Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning, e.g., GPT-4o achieves only 6.9% accuracy, while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on"test-time scaling"further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating vision-centric complex video reasoning in MLLMs

Assessing recall, inference, and prediction skills in video understanding

Benchmarking MLLMs on fine-grained visual state reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces VideoReasonBench for vision-centric video reasoning

Evaluates recalling, inferring, and predicting video content

Uses test-time scaling to enhance reasoning performance

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs