SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval

πŸ“… 2026-03-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the common oversight of speech information in existing video–text retrieval methods and the insufficient modeling of audio within multimodal fusion strategies. To this end, we propose SAVE, a novel approach that extends the CLIP architecture with a dedicated speech encoding branch and introduces a soft-ALBEF mechanism to enable early alignment and fusion of visual and audio modalities, thereby constructing a speech-aware multimodal video representation. Extensive experiments demonstrate that SAVE significantly outperforms current state-of-the-art methods across five benchmark datasets, achieving relative improvements in SumR of +4.1% on MSRVTT-9k, +1.9% on MSRVTT-7k, +2.5% on VATEX, +9.8% on Charades, and +2.1% on LSMDC.

Technology Category

Application Category

πŸ“ Abstract
For video-text retrieval, the use of CLIP has been a de facto choice. Since CLIP provides only image and text encoders, this consensus has led to a biased paradigm that entirely ignores the sound track of videos. While several attempts have been made to reintroduce audio -- typically by incorporating an audio encoder and fusing its output with visual features -- these methods face two challenges: ineffective representation of speech content and suboptimal vision-audio fusion. To address these issues jointly, we propose SAVE, a Speech Aware Video rEpresentation learning method. SAVE improves upon AVIGATE, a SOTA audiovisual method, with a dedicated speech branch for more effective speech embedding. Furthermore, we introduce soft-ALBEF for early vision-audio alignment that facilitates fusion. Extensive experiments on five benchmarks show that SAVE compares favorably against the SOTA, outperforming AVIGATE by +4.1% on MSRVTT-9k, +1.9% on MSRVTT-7k, +2.5% on VATEX, +9.8% on Charades, and +2.1% on LSMDC, in light of the SumR metric.
Problem

Research questions and friction points this paper is trying to address.

video-text retrieval
speech representation
audiovisual fusion
CLIP
multimodal learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

speech-aware representation
video-text retrieval
audiovisual fusion
soft-ALBEF
multimodal learning
πŸ”Ž Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30