Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) exhibit limited capability in holistic narrative understanding of long videos (50–170 minutes), primarily due to benchmark deficiencies—overemphasis on fine-grained fact retrieval or reliance on low-quality, self-generated questions—that fail to assess higher-order reasoning, including causal inference, temporal modeling, and motivation understanding. To address this, we introduce MF², the first fine-grained narrative understanding benchmark tailored to full-length films. MF² comprises 850+ human-authored true/false statement pairs targeting character motivation, event causality, and temporal consistency. It adopts a binary judgment paradigm to avoid multiple-choice biases and is constructed exclusively from openly licensed films, prioritizing human-rememberable, story-level comprehension. Experiments show that state-of-the-art open- and closed-source VLMs underperform humans by a substantial margin on MF², confirming that deep narrative modeling for long-form video remains a fundamental unsolved challenge.

Technology Category

Application Category

📝 Abstract
Despite recent progress in vision-language models (VLMs), holistic understanding of long-form video content remains a significant challenge, partly due to limitations in current benchmarks. Many focus on peripheral, ``needle-in-a-haystack'' details, encouraging context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by language models themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we introduce MF$^2$, a new benchmark for evaluating whether models can comprehend, consolidate, and recall key narrative information from full-length movies (50-170 minutes long). MF$^2$ includes over 50 full-length, open-licensed movies, each paired with manually constructed sets of claim pairs -- one true (fact) and one plausible but false (fib), totalling over 850 pairs. These claims target core narrative elements such as character motivations and emotions, causal chains, and event order, and refer to memorable moments that humans can recall without rewatching the movie. Instead of multiple-choice formats, we adopt a binary claim evaluation protocol: for each pair, models must correctly identify both the true and false claims. This reduces biases like answer ordering and enables a more precise assessment of reasoning. Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance, underscoring the relative ease of the task for humans and their superior ability to retain and reason over critical narrative information -- an ability current VLMs lack.
Problem

Research questions and friction points this paper is trying to address.

Evaluating model comprehension of long movie narratives
Addressing limitations in current video understanding benchmarks
Assessing recall of key narrative elements without rewatching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Binary claim evaluation protocol for movies
Manually constructed true and false claims
Focus on core narrative elements understanding
🔎 Similar Papers
No similar papers found.
E
Emmanouil Zaranis
Instituto Superior Técnico, Universidade de Lisboa; Instituto de Telecomunicações
António Farinhas
António Farinhas
Sword Health
Machine LearningNatural Language Processing
S
Saul Santos
Instituto Superior Técnico, Universidade de Lisboa; Instituto de Telecomunicações
B
Beatriz Canaverde
Instituto Superior Técnico, Universidade de Lisboa; Instituto de Telecomunicações
M
Miguel Moura Ramos
Instituto Superior Técnico, Universidade de Lisboa; Instituto de Telecomunicações
A
Aditya K Surikuchi
UNC Chapel Hill
A
André Viveiros
Instituto Superior Técnico, Universidade de Lisboa; Instituto de Telecomunicações
Baohao Liao
Baohao Liao
PhD at the Language Technology Lab, University of Amsterdam
AgentReasoningEfficiency
E
Elena Bueno-Benito
Institut de Robòtica i Informàtica Industrial, CSIC-UPC
N
Nithin Sivakumaran
UNC Chapel Hill
P
Pavlo Vasylenko
Instituto Superior Técnico, Universidade de Lisboa; Instituto de Telecomunicações
Shoubin Yu
Shoubin Yu
PhD Candidate at UNC Chapel Hill
Multimodal AIMachine LearningComputer VisionVideo Understanding
S
Sonal Sannigrahi
Instituto Superior Técnico, Universidade de Lisboa; Instituto de Telecomunicações
W
Wafaa Mohammed
Language Technology Lab, University of Amsterdam
B
Ben Peters
University of Copenhagen
Danae Sánchez Villegas
Danae Sánchez Villegas
University of Copenhagen
Natural Language ProcessingMultimodal LearningMachine LearningComputational Social Science
Elias Stengel-Eskin
Elias Stengel-Eskin
Assistant Professor, University of Texas at Austin
Natural language processingcomputational semanticscomputational linguistics
Giuseppe Attanasio
Giuseppe Attanasio
Postdoctoral Researcher, Instituto de Telecomunicações
AIFairnessTransparencySafety
J
Jaehong Yoon
UNC Chapel Hill
S
Stella Frank
University of Copenhagen; Pioneer Center for AI
Alessandro Suglia
Alessandro Suglia
Assistant Professor, Heriot-Watt University, Edinburgh Centre for Robotics, National Robotarium
Multimodal Generative AIEmbodied AIConversational AI
C
Chrysoula Zerva
Instituto Superior Técnico, Universidade de Lisboa; ELLIS Unit Lisbon
Desmond Elliott
Desmond Elliott
Associate Professor, University of Copenhagen
Natural Language ProcessingVision-LanguageTokenization-free Language Models
Mariella Dimiccoli
Mariella Dimiccoli
Institute of Robotics and Industrial Informatics (CSIC-UPC)
Artificial IntelligenceMachine LearningComputer VisionPattern RecognitionMultimedia
Mohit Bansal
Mohit Bansal
Parker Distinguished Professor, Computer Science, UNC Chapel Hill
Natural Language ProcessingComputer VisionMachine LearningMultimodal AI
Oswald Lanz
Oswald Lanz
Free University of Bozen-Bolzano
computer visiondeep learningvideo understanding
Raffaella Bernardi
Raffaella Bernardi
Free University of Bolzano Bozen
Language Grounding to VisionDialoguesReasoningSyntax-Semantics interface
Raquel Fernández
Raquel Fernández
Institute for Logic, Language and Computation, University of Amsterdam
Dialogue & PragmaticsConversational AINatural Language ProcessingComputational LinguisticsCognitive Science
Sandro Pezzelle
Sandro Pezzelle
Assistant Professor at ILLC, University of Amsterdam
Natural Language ProcessingMultimodal Machine LearningAICognitive science
Vlad Niculae
Vlad Niculae
University of Amsterdam
Structured PredictionNatural Language ProcessingMachine Learning
A
André F. T. Martins
Instituto Superior Técnico, Universidade de Lisboa; Instituto de Telecomunicações; Unbabel; ELLIS Unit Lisbon