MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction

📅 2026-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of character identity inconsistency and narrative incoherence commonly observed in long-video (e.g., movie) summarization by current vision-language models. To overcome these limitations without requiring model fine-tuning, the authors propose a tool-augmented, progressive abstraction framework. The approach first leverages off-the-shelf face recognition models to establish factual anchors for character identities and then employs a multi-stage prompting strategy to guide the vision-language model in iteratively compressing and refining the video content. This design effectively circumvents the constraints imposed by limited context windows. Experimental results demonstrate that the proposed method significantly improves character consistency, factual accuracy, and overall narrative coherence in generated summaries, outperforming end-to-end baseline models.

Technology Category

Application Category

📝 Abstract
With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external "tool" to establish Factual Groundings--precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM's reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.
Problem

Research questions and friction points this paper is trying to address.

video summarization
character consistency
narrative coherence
vision-language models
long-form video
Innovation

Methods, ideas, or system contributions that make the work stand out.

tool-augmented
ID-consistent
progressive abstraction
factual grounding
vision-language models
🔎 Similar Papers
No similar papers found.