Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Current vision-language models (VLMs) exhibit limited capability in holistic narrative understanding of long videos (50–170 minutes), primarily due to benchmark deficiencies—overemphasis on fine-grained fact retrieval or reliance on low-quality, self-generated questions—that fail to assess higher-order reasoning, including causal inference, temporal modeling, and motivation understanding. To address this, we introduce MF², the first fine-grained narrative understanding benchmark tailored to full-length films. MF² comprises 850+ human-authored true/false statement pairs targeting character motivation, event causality, and temporal consistency. It adopts a binary judgment paradigm to avoid multiple-choice biases and is constructed exclusively from openly licensed films, prioritizing human-rememberable, story-level comprehension. Experiments show that state-of-the-art open- and closed-source VLMs underperform humans by a substantial margin on MF², confirming that deep narrative modeling for long-form video remains a fundamental unsolved challenge.

Technology Category

Application Category

📝 Abstract

Despite recent progress in vision-language models (VLMs), holistic understanding of long-form video content remains a significant challenge, partly due to limitations in current benchmarks. Many focus on peripheral, ``needle-in-a-haystack'' details, encouraging context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by language models themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we introduce MF$^2$, a new benchmark for evaluating whether models can comprehend, consolidate, and recall key narrative information from full-length movies (50-170 minutes long). MF$^2$ includes over 50 full-length, open-licensed movies, each paired with manually constructed sets of claim pairs -- one true (fact) and one plausible but false (fib), totalling over 850 pairs. These claims target core narrative elements such as character motivations and emotions, causal chains, and event order, and refer to memorable moments that humans can recall without rewatching the movie. Instead of multiple-choice formats, we adopt a binary claim evaluation protocol: for each pair, models must correctly identify both the true and false claims. This reduces biases like answer ordering and enables a more precise assessment of reasoning. Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance, underscoring the relative ease of the task for humans and their superior ability to retain and reason over critical narrative information -- an ability current VLMs lack.

Problem

Research questions and friction points this paper is trying to address.

Evaluating model comprehension of long movie narratives

Addressing limitations in current video understanding benchmarks

Assessing recall of key narrative elements without rewatching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Binary claim evaluation protocol for movies

Manually constructed true and false claims

Focus on core narrative elements understanding

🔎 Similar Papers

Long Story Short: Story-level Video Understanding from 20K Short Films