Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Can video foundation models serve as zero-shot visual reasoners for complex reasoning tasks? This paper addresses this question by introducing MME-CoF—the first standardized benchmark for video chain-of-frame reasoning—using Veo-3 as a representative model. MME-CoF spans 12 reasoning dimensions, including spatial, geometric, physical, temporal, and embodied logic. Adopting a strict zero-shot evaluation protocol, the study systematically reveals that current video models exhibit strong short-term dynamic consistency but face significant bottlenecks in long-horizon causal modeling and rigorous abstract reasoning (e.g., geometric constraints and counterfactual inference). Crucially, MME-CoF enables fine-grained, frame-level quantification of video models’ reasoning capabilities—marking the first such diagnostic framework. It thus establishes a critical benchmark and analytical tool to guide future architectural innovations and evaluation methodologies in video understanding.

Technology Category

Application Category

📝 Abstract

Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io

Problem

Research questions and friction points this paper is trying to address.

Evaluating video models' zero-shot reasoning in visual scenarios

Assessing spatial, geometric, physical, and temporal reasoning capabilities

Identifying limitations in long-horizon causal and abstract logic reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates video models using MME-CoF benchmark

Systematically assesses reasoning across 12 dimensions

Combines video models with dedicated reasoning engines

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence