SF20K Competition 2025: Summary and findings

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses story-level video understanding in long-form amateur short films, moving beyond conventional short-clip action recognition. It introduces SF20K-Test, the first open-ended video question-answering benchmark comprising 95 short films and 979 question-answer pairs, featuring a main track and a special track restricted to models under 8B parameters. The proposed approach integrates shot-level processing, subtitle fusion, and a multi-stage reasoning pipeline, evaluated via an automated LLM-QA-Eval system based on GPT-4.1-nano. Experimental results demonstrate that the multi-stage small-model pipeline rivals end-to-end inference by much larger models, achieving peak accuracies of 65.7% in the main track and 48.7% in the special track—substantially below the human performance ceiling of 91.7%, thereby highlighting a significant gap in current models’ narrative comprehension capabilities.

📝 Abstract

This report presents the results and findings of the first edition of the Short-Films 20K (SF20K) Competition, held in conjunction with the SLoMO Workshop at ICCV 2025. The competition is designed to advance story-level video understanding beyond short-clip action recognition, introducing an open-ended video question-answering task built on a corpus of amateur short films. This setup ensures that models must rely on multimodal understanding rather than memorization of popular movies. Evaluation is conducted using the SF20K-Test benchmark (95 movies, 979 question-answer pairs) and scored via LLM-QA-Eval, an automated judge based on GPT-4.1-nano. The competition attracted 22 teams and 286 submissions across two tracks: a Main Track with unrestricted model size and a Special Track limited to models under 8 billion parameters. The winning team achieved 65.7% accuracy on the Main Track and 48.7% on the Special Track, against a human performance ceiling of 91.7%. Our analysis reveals several key findings: narrative-aware, shot-level processing consistently outperforms uniform frame sampling; well-designed multi-stage pipelines using smaller models can match or exceed end-to-end inference with models over 30x larger; and subtitle quality is a dominant factor in performance. These results highlight that the primary bottleneck in long-form video QA lies in information selection and reasoning structure rather than raw model capacity, and that a substantial gap remains between current methods and human-level narrative comprehension.

Problem

Research questions and friction points this paper is trying to address.

video question answering

story-level understanding

long-form video

narrative comprehension

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

narrative-aware video understanding

open-ended video QA

multimodal reasoning