🤖 AI Summary
This work addresses the insufficient cross-modal协同 modeling of visual, audio, and textual modalities in script-driven video summarization. We propose a weighted cross-modal attention mechanism that explicitly models fine-grained semantic alignment between user-provided scripts and video frames, as well as between scripts and ASR transcripts. Our method fuses visual, acoustic, and textual features and leverages semantic similarity to guide key-shot selection for multimodal joint summary generation. To support this task, we construct and publicly release two large-scale script–video–transcript triplet datasets. Extensive experiments demonstrate that our approach significantly outperforms both state-of-the-art script-driven and generic video summarization methods across multiple benchmarks, achieving an average +4.2% F1-score improvement. The source code and datasets are made publicly available.
📝 Abstract
In this work, we extend a recent method for script-driven video summarization, originally considering just the visual content of the video, to take into account the relevance of the user-provided script also with the video's spoken content. In the proposed method, SD-MVSum, the dependence between each considered pair of data modalities, i.e., script-video and script-transcript, is modeled using a new weighted cross-modal attention mechanism. This explicitly exploits the semantic similarity between the paired modalities in order to promote the parts of the full-length video with the highest relevance to the user-provided script. Furthermore, we extend two large-scale datasets for video summarization (S-VideoXum, MrHiSum), to make them suitable for training and evaluation of script-driven multimodal video summarization methods. Experimental comparisons document the competitiveness of our SD-MVSum method against other SOTA approaches for script-driven and generic video summarization. Our new method and extended datasets are available at: https://github.com/IDT-ITI/SD-MVSum.