TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video summarization methods struggle to adapt to the dynamic shifts in multimodal salience within videos due to their reliance on static or modality-agnostic fusion strategies. To address this limitation, this work proposes TripleSumm, an architecture that achieves frame-level adaptive fusion of visual, textual, and audio modalities for the first time. Additionally, we introduce MoSu, the first large-scale multimodal video summarization benchmark dataset. By integrating adaptive modality weighting, multimodal feature alignment, and deep learning-based modeling, TripleSumm substantially outperforms existing approaches across four benchmarks, including MoSu, establishing new state-of-the-art performance in multimodal video summarization.

Technology Category

Application Category

📝 Abstract
The exponential growth of video content necessitates effective video summarization to efficiently extract key information from long videos. However, current approaches struggle to fully comprehend complex videos, primarily because they employ static or modality-agnostic fusion strategies. These methods fail to account for the dynamic, frame-dependent variations in modality saliency inherent in video data. To overcome these limitations, we propose TripleSumm, a novel architecture that adaptively weights and fuses the contributions of visual, text, and audio modalities at the frame level. Furthermore, a significant bottleneck for research into multimodal video summarization has been the lack of comprehensive benchmarks. Addressing this bottleneck, we introduce MoSu (Most Replayed Multimodal Video Summarization), the first large-scale benchmark that provides all three modalities. Extensive experiments demonstrate that TripleSumm achieves state-of-the-art performance, outperforming existing methods by a significant margin on four benchmarks, including MoSu. Our code and dataset are available at https://github.com/smkim37/TripleSumm.
Problem

Research questions and friction points this paper is trying to address.

video summarization
multimodal fusion
modality saliency
benchmark dataset
frame-level adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive fusion
multimodal video summarization
frame-level weighting
triple-modality
benchmark dataset
🔎 Similar Papers
S
Sumin Kim
Seoul National University
H
Hyemin Jeong
Seoul National University
M
Mingu Kang
Seoul National University
Y
Yejin Kim
Seoul National University
Y
Yoori Oh
Seoul National University
Joonseok Lee
Joonseok Lee
Google Research, Seoul National University
Machine LearningComputer VisionVideo UnderstandingRecommendation SystemsCollaborative Filtering