🤖 AI Summary
Existing video summarization methods face a fundamental trade-off between narrative coherence and faithful reference to original video segments: extractive approaches lack semantic continuity, while abstractive methods cannot precisely embed verbatim video clips. This paper proposes a retrieval-augmented generation paradigm that enables large language models (LLMs) to dynamically reserve “citation slots” during script generation—slots subsequently filled by a multi-granularity cross-modal retrieval module that accurately locates and inserts corresponding segments from long videos. Our approach comprises three components: (1) fine-tuning an LLM to generate coherent scripts with placeholder tokens; (2) building a fine-grained video–text alignment retrieval module; and (3) introducing a document-style teaser evaluation protocol. Objective metrics demonstrate significant improvements in both segment localization accuracy and narrative consistency. Subjective evaluations show consistent superiority over state-of-the-art extractive and abstractive methods across coherence, semantic alignment, and visual fidelity.
📝 Abstract
Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports the narrative from a pool of candidate quotable video clips. We examine the proposed method on the task of documentary teaser generation, where short interview insertions are commonly used to support the narrative of a documentary. Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative. In a subjective survey, we show that our proposed method outperforms existing abstractive and extractive approaches in terms of coherence, alignment, and realism in teaser generation.