🤖 AI Summary
Existing narrative understanding benchmarks are often confined to individual subtasks, limiting their ability to evaluate models’ capacity for constructing coherent narrative worlds and performing cross-task reasoning and generation. To address this gap, this work proposes STAGE, a unified benchmark built upon 150 high-quality English–Chinese film scripts that, for the first time, integrates four core tasks—knowledge graph construction, scene-level event summarization, long-context question answering, and role-playing—within a shared narrative world representation framework. STAGE enables comprehensive, cross-lingual, and multi-dimensional evaluation, encompassing critical capabilities such as script preprocessing, event and character annotation, long-range reasoning, and character-consistent generation, thereby offering a holistic assessment of models in world modeling, event abstraction, long-context comprehension, and persona-aware response generation.
📝 Abstract
Movie screenplays are rich long-form narratives that interleave complex character relationships, temporally ordered events, and dialogue-driven interactions. While prior benchmarks target individual subtasks such as question answering or dialogue generation, they rarely evaluate whether models can construct a coherent story world and use it consistently across multiple forms of reasoning and generation. We introduce STAGE (Screenplay Text, Agents, Graphs and Evaluation), a unified benchmark for narrative understanding over full-length movie screenplays. STAGE defines four tasks: knowledge graph construction, scene-level event summarization, long-context screenplay question answering, and in-script character role-playing, all grounded in a shared narrative world representation. The benchmark provides cleaned scripts, curated knowledge graphs, and event- and character-centric annotations for 150 films across English and Chinese, enabling holistic evaluation of models'abilities to build world representations, abstract and verify narrative events, reason over long narratives, and generate character-consistent responses.