Cap2Sum: Learning to Summarize Videos by Generating Captions

📅 2024-08-23

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

167K/year

🤖 AI Summary

To address the high cost of manual annotation and limited data scale in video summarization—which severely hinder model generalization—this paper proposes a large-scale training paradigm leveraging dense video captions as weak supervision. Methodologically, it pioneers the use of dense captions instead of human-generated summaries as supervisory signals; incorporates CLIP’s vision-language priors to explicitly recover salient objects missing from captions; and designs a Transformer-based cross-modal generation architecture that integrates weakly supervised learning with zero-shot transfer and cross-dataset fine-tuning strategies. Evaluated on two newly constructed benchmarks—TVSum-Caption and SumMe-Caption—the approach substantially outperforms prior methods, achieving significant improvements in summary quality and cross-domain generalization. This work establishes a viable pathway toward low-cost, large-scale video summarization.

Technology Category

Application Category

📝 Abstract

With the rapid growth of video data on the internet, video summarization is becoming a very important AI technology. However, due to the high labelling cost of video summarization, existing studies have to be conducted on small-scale datasets, leading to limited performance and generalization capacity. In this work, we introduce the use of dense video captions as a supervision signal to train video summarization models. Motivated by this, we propose Cap2Sum, a model that learns to summarize videos by generating captions, to exploit dense video caption annotations. This weakly-supervised approach allows us to train the models on large-scale dense video caption datasets to achieve better performance and generalization capacity. To further improve the generalization capacity, we introduce a CLIP (a strong vision-language model) Prior mechanism to enhance the learning of important objects that captions may ignore in the videos. In practice, Cap2Sum can perform zero-shot video summarization or be fine-tuned by the ground-truth summary or video caption of the target dataset. To examine the performance of Cap2Sum after weakly-supervised fine-tuning by the video captions, we propose two new datasets, TVSum-Caption and SumMe-Caption, which are derived from two common video summarization datasets and will be publicly released. We conduct extensive experiments and the results demonstrate that our method achieves significant improvements in performance and generalization capacity compared with previous methods.

Problem

Research questions and friction points this paper is trying to address.

Using dense video captions as supervision for video summarization training

Enhancing generalization with CLIP Prior for important object learning

Enabling zero-shot summarization and caption-based fine-tuning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using dense video captions as supervision signal

Introducing CLIP Prior mechanism for object enhancement

Performing zero-shot summarization or fine-tuning with captions

🔎 Similar Papers

No similar papers found.