🤖 AI Summary
Existing multimodal large language models (MLLMs) are evaluated on video understanding tasks using isolated frames or single videos, failing to capture continuous, narrative-driven sequences prevalent in real-world scenarios. Method: We introduce SeriesBench, the first multi-task benchmark for episode-level narrative understanding, comprising 105 TV episodes and 28 fine-grained narrative tasks. It features a novel long-horizon narrative annotation scheme and a full-information task auto-conversion mechanism. We further propose PC-DCoT, a reasoning framework that explicitly models plot-level causal chains and dynamic character interactions. Contribution/Results: Experiments reveal significant bottlenecks in current MLLMs’ episode-level narrative comprehension. PC-DCoT boosts average accuracy of mainstream models on SeriesBench by 19.7%. The benchmark is publicly released and accepted at CVPR 2025.
📝 Abstract
With the rapid development of Multi-modal Large Language Models (MLLMs), an increasing number of benchmarks have been established to evaluate the video understanding capabilities of these models. However, these benchmarks focus on extbf{standalone} videos and mainly assess ``visual elements'' like human actions and object states. In reality, contemporary videos often encompass complex and continuous narratives, typically presented as a extbf{series}. To address this challenge, we propose extbf{SeriesBench}, a benchmark consisting of 105 carefully curated narrative-driven series, covering 28 specialized tasks that require deep narrative understanding. Specifically, we first select a diverse set of drama series spanning various genres. Then, we introduce a novel long-span narrative annotation method, combined with a full-information transformation approach to convert manual annotations into diverse task formats. To further enhance model capacity for detailed analysis of plot structures and character relationships within series, we propose a novel narrative reasoning framework, extbf{PC-DCoT}. Extensive results on extbf{SeriesBench} indicate that existing MLLMs still face significant challenges in understanding narrative-driven series, while extbf{PC-DCoT} enables these MLLMs to achieve performance improvements. Overall, our extbf{SeriesBench} and extbf{PC-DCoT} highlight the critical necessity of advancing model capabilities to understand narrative-driven series, guiding the future development of MLLMs. SeriesBench is publicly available at https://github.com/zackhxn/SeriesBench-CVPR2025.