🤖 AI Summary
This paper introduces “script-driven video summarization”: given a user-written natural language script, the task is to automatically select the most relevant segments from a long video to generate a personalized summary. Methodologically, we construct the first public benchmark dataset comprising video–summary–script triplets; design a cross-modal attention mechanism and a vision–language alignment fusion module; and extend the VideoXum framework to support multi-granularity, fine-grained script–video alignment annotations. Experiments demonstrate that our approach significantly outperforms existing query-driven and generic video summarization methods across key metrics—including content consistency, script fidelity, and personalization adaptability—thereby enabling precise, controllable modeling of user intent.
📝 Abstract
In this work, we introduce the task of script-driven video summarization, which aims to produce a summary of the full-length video by selecting the parts that are most relevant to a user-provided script outlining the visual content of the desired summary. Following, we extend a recently-introduced large-scale dataset for generic video summarization (VideoXum) by producing natural language descriptions of the different human-annotated summaries that are available per video. In this way we make it compatible with the introduced task, since the available triplets of ``video, summary and summary description'' can be used for training a method that is able to produce different summaries for a given video, driven by the provided script about the content of each summary. Finally, we develop a new network architecture for script-driven video summarization (SD-VSum), that relies on the use of a cross-modal attention mechanism for aligning and fusing information from the visual and text modalities. Our experimental evaluations demonstrate the advanced performance of SD-VSum against state-of-the-art approaches for query-driven and generic (unimodal and multimodal) summarization from the literature, and document its capacity to produce video summaries that are adapted to each user's needs about their content.