CFSum: A Transformer-Based Multi-Modal Video Summarization Framework With Coarse-Fine Fusion

πŸ“… 2025-03-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing multimodal video summarization methods suffer from insufficient cross-modal fusion and neglect of the audio modality. To address these limitations, this paper proposes a user-driven tri-modal (video, text, audio) video summarization framework based on a two-stage Transformer architecture: the first stage performs coarse-grained global fusion across all three modalities, while the second stage enables fine-grained audio–visual alignment via text-guided cross-modal attention. This work is the first to jointly model and deeply interconnect all three modalities on equal footing in video summarization, explicitly enhancing multimodal synergy under explicit textual supervision. Our approach achieves significant improvements over state-of-the-art methods on multiple benchmark datasets. Ablation studies validate both the effectiveness and robustness of the proposed tri-modal joint modeling and the coarse-to-fine hierarchical fusion mechanism.

Technology Category

Application Category

πŸ“ Abstract
Video summarization, by selecting the most informative and/or user-relevant parts of original videos to create concise summary videos, has high research value and consumer demand in today's video proliferation era. Multi-modal video summarization that accomodates user input has become a research hotspot. However, current multi-modal video summarization methods suffer from two limitations. First, existing methods inadequately fuse information from different modalities and cannot effectively utilize modality-unique features. Second, most multi-modal methods focus on video and text modalities, neglecting the audio modality, despite the fact that audio information can be very useful in certain types of videos. In this paper we propose CFSum, a transformer-based multi-modal video summarization framework with coarse-fine fusion. CFSum exploits video, text, and audio modal features as input, and incorporates a two-stage transformer-based feature fusion framework to fully utilize modality-unique information. In the first stage, multi-modal features are fused simultaneously to perform initial coarse-grained feature fusion, then, in the second stage, video and audio features are explicitly attended with the text representation yielding more fine-grained information interaction. The CFSum architecture gives equal importance to each modality, ensuring that each modal feature interacts deeply with the other modalities. Our extensive comparative experiments against prior methods and ablation studies on various datasets confirm the effectiveness and superiority of CFSum.
Problem

Research questions and friction points this paper is trying to address.

Inadequate fusion of multi-modal information in video summarization.
Neglect of audio modality in existing multi-modal summarization methods.
Need for effective utilization of unique features from video, text, and audio.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based multi-modal video summarization framework
Coarse-fine fusion for multi-modal feature integration
Equal importance to video, text, and audio modalities
πŸ”Ž Similar Papers
No similar papers found.