Unsupervised Video Highlight Detection by Learning from Audio and Visual Recurrence

πŸ“… 2024-07-18
πŸ›οΈ IEEE Workshop/Winter Conference on Applications of Computer Vision
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the challenge of unsupervised highlight detection amid explosive growth in video content, this paper proposes a fully unsupervised audio-visual co-learning method that requires no human annotations. It leverages cross-sample recurrence of audio-visual features across semantically similar videos to automatically generate pseudo-labels. Our key contribution is the first modeling of cross-video audio-visual modality recurrence under unsupervised settings, coupled with a clustering-based pseudo-class guidance mechanism to produce reliable audio-visual pseudo-highlights. The method comprises multimodal feature extraction, cross-video audio similarity modeling, K-means–driven pseudo-label generation, and a self-supervised highlight detection network. Extensive experiments demonstrate significant improvements over state-of-the-art unsupervised approaches on three benchmarks. Ablation studies confirm the critical performance gain from the audio modality and validate the effectiveness of joint audio-visual recurrence modeling.

Technology Category

Application Category

πŸ“ Abstract
With the exponential growth of video content, the need for automated video highlight detection to extract key moments or highlights from lengthy videos has become increasingly pressing. This technology has the potential to enhance user experiences by allowing quick access to relevant content across diverse domains. Existing methods typically rely either on expensive manually labeled frame-level annotations, or on a large external dataset of videos for weak supervision through category information. To overcome this, we focus on unsupervised video highlight detection, eliminating the need for manual annotations. We propose a novel unsupervised approach which capitalizes on the premise that significant moments tend to recur across multiple videos of the similar category in both audio and visual modalities. Surprisingly, audio remains under-explored, especially in unsupervised algorithms, despite its potential to detect key moments. Through a clustering technique, we identify pseudo-categories of videos and compute audio pseudo-highlight scores for each video by measuring the similarities of audio features among audio clips of all the videos within each pseudo-category. Similarly, we also compute visual pseudo-highlight scores for each video using visual features. Then, we combine audio and visual pseudo-highlights to create the audio-visual pseudo ground-truth highlight of each video for training an audio-visual highlight detection network. Extensive experiments and ablation studies on three benchmarks showcase the superior performance of our method over prior work.
Problem

Research questions and friction points this paper is trying to address.

Detects video highlights without manual annotations
Leverages audio and visual recurrence across similar videos
Combines audio-visual features for unsupervised highlight detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised video highlight detection without manual annotations
Leverages audio and visual recurrence across similar videos
Combines audio-visual pseudo ground-truth for training network