🤖 AI Summary
To address the high cost of manual annotation and limited data scale in video summarization—which severely hinder model generalization—this paper proposes a large-scale training paradigm leveraging dense video captions as weak supervision. Methodologically, it pioneers the use of dense captions instead of human-generated summaries as supervisory signals; incorporates CLIP’s vision-language priors to explicitly recover salient objects missing from captions; and designs a Transformer-based cross-modal generation architecture that integrates weakly supervised learning with zero-shot transfer and cross-dataset fine-tuning strategies. Evaluated on two newly constructed benchmarks—TVSum-Caption and SumMe-Caption—the approach substantially outperforms prior methods, achieving significant improvements in summary quality and cross-domain generalization. This work establishes a viable pathway toward low-cost, large-scale video summarization.
📝 Abstract
With the rapid growth of video data on the internet, video summarization is becoming a very important AI technology. However, due to the high labelling cost of video summarization, existing studies have to be conducted on small-scale datasets, leading to limited performance and generalization capacity. In this work, we introduce the use of dense video captions as a supervision signal to train video summarization models. Motivated by this, we propose Cap2Sum, a model that learns to summarize videos by generating captions, to exploit dense video caption annotations. This weakly-supervised approach allows us to train the models on large-scale dense video caption datasets to achieve better performance and generalization capacity. To further improve the generalization capacity, we introduce a CLIP (a strong vision-language model) Prior mechanism to enhance the learning of important objects that captions may ignore in the videos. In practice, Cap2Sum can perform zero-shot video summarization or be fine-tuned by the ground-truth summary or video caption of the target dataset. To examine the performance of Cap2Sum after weakly-supervised fine-tuning by the video captions, we propose two new datasets, TVSum-Caption and SumMe-Caption, which are derived from two common video summarization datasets and will be publicly released. We conduct extensive experiments and the results demonstrate that our method achieves significant improvements in performance and generalization capacity compared with previous methods.