REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video summarization methods face a fundamental trade-off between narrative coherence and faithful reference to original video segments: extractive approaches lack semantic continuity, while abstractive methods cannot precisely embed verbatim video clips. This paper proposes a retrieval-augmented generation paradigm that enables large language models (LLMs) to dynamically reserve “citation slots” during script generation—slots subsequently filled by a multi-granularity cross-modal retrieval module that accurately locates and inserts corresponding segments from long videos. Our approach comprises three components: (1) fine-tuning an LLM to generate coherent scripts with placeholder tokens; (2) building a fine-grained video–text alignment retrieval module; and (3) introducing a document-style teaser evaluation protocol. Objective metrics demonstrate significant improvements in both segment localization accuracy and narrative consistency. Subjective evaluations show consistent superiority over state-of-the-art extractive and abstractive methods across coherence, semantic alignment, and visual fidelity.

Technology Category

Application Category

📝 Abstract
Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports the narrative from a pool of candidate quotable video clips. We examine the proposed method on the task of documentary teaser generation, where short interview insertions are commonly used to support the narrative of a documentary. Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative. In a subjective survey, we show that our proposed method outperforms existing abstractive and extractive approaches in terms of coherence, alignment, and realism in teaser generation.
Problem

Research questions and friction points this paper is trying to address.

Generating coherent short videos with embedded clips from long videos
Enabling multimodal quote insertion while maintaining narrative coherence
Improving documentary teaser quality through retrieval-embedded generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-embedded generation framework for video editing
Finetuned large language model for script generation
Novel retrieval model for optimal video clip selection
🔎 Similar Papers
No similar papers found.