SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the insufficient cross-modal协同 modeling of visual, audio, and textual modalities in script-driven video summarization. We propose a weighted cross-modal attention mechanism that explicitly models fine-grained semantic alignment between user-provided scripts and video frames, as well as between scripts and ASR transcripts. Our method fuses visual, acoustic, and textual features and leverages semantic similarity to guide key-shot selection for multimodal joint summary generation. To support this task, we construct and publicly release two large-scale script–video–transcript triplet datasets. Extensive experiments demonstrate that our approach significantly outperforms both state-of-the-art script-driven and generic video summarization methods across multiple benchmarks, achieving an average +4.2% F1-score improvement. The source code and datasets are made publicly available.

Technology Category

Application Category

📝 Abstract
In this work, we extend a recent method for script-driven video summarization, originally considering just the visual content of the video, to take into account the relevance of the user-provided script also with the video's spoken content. In the proposed method, SD-MVSum, the dependence between each considered pair of data modalities, i.e., script-video and script-transcript, is modeled using a new weighted cross-modal attention mechanism. This explicitly exploits the semantic similarity between the paired modalities in order to promote the parts of the full-length video with the highest relevance to the user-provided script. Furthermore, we extend two large-scale datasets for video summarization (S-VideoXum, MrHiSum), to make them suitable for training and evaluation of script-driven multimodal video summarization methods. Experimental comparisons document the competitiveness of our SD-MVSum method against other SOTA approaches for script-driven and generic video summarization. Our new method and extended datasets are available at: https://github.com/IDT-ITI/SD-MVSum.
Problem

Research questions and friction points this paper is trying to address.

Extending script-driven video summarization to include spoken content
Modeling cross-modal dependencies between script, video and transcript
Creating extended datasets for multimodal video summarization training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weighted cross-modal attention models script-video and script-transcript dependencies
Exploits semantic similarity to prioritize script-relevant video segments
Extends datasets for multimodal script-driven video summarization training
🔎 Similar Papers
No similar papers found.