🤖 AI Summary
Existing video question answering (VQA) datasets predominantly emphasize surface-level comprehension, failing to support deep cognitive reasoning about cinematic content. To address this limitation, we introduce the first high-quality VQA dataset explicitly designed for deep cinematic understanding. Our method comprises three core components: (1) a multi-agent brainstorming framework wherein large language models serve as collaborative cognitive agents to generate structured, semantically rich question-answer pairs; (2) a cognitively grounded, quantifiable evaluation framework that systematically measures question depth and pedagogical value along defined cognitive dimensions; and (3) an Agentic Choice Enhancement (ACE) module that dynamically refines reasoning paths during model training. Experimental results demonstrate that ACE-enhanced VQA models achieve a 25% performance gain on deep-reasoning benchmarks, substantially outperforming state-of-the-art approaches. This work establishes a new benchmark and methodological paradigm for fine-grained semantic understanding of film.
📝 Abstract
This paper introduces MovieCORE, a novel video question answering (VQA) dataset designed to probe deeper cognitive understanding of movie content. Unlike existing datasets that focus on surface-level comprehension, MovieCORE emphasizes questions that engage System-2 thinking while remaining specific to the video material. We present an innovative agentic brainstorming approach, utilizing multiple large language models (LLMs) as thought agents to generate and refine high-quality question-answer pairs. To evaluate dataset quality, we develop a set of cognitive tests assessing depth, thought-provocation potential, and syntactic complexity. We also propose a comprehensive evaluation scheme for assessing VQA model performance on deeper cognitive tasks. To address the limitations of existing video-language models (VLMs), we introduce an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves model reasoning capabilities post-training by up to 25%. Our work contributes to advancing movie understanding in AI systems and provides valuable insights into the capabilities and limitations of current VQA models when faced with more challenging, nuanced questions about cinematic content. Our project page, dataset and code can be found at https://joslefaure.github.io/assets/html/moviecore.html.