Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval

πŸ“… 2025-04-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing video-text retrieval methods commonly neglect the audio modality or blindly integrate low-quality audio signals, thereby limiting cross-modal representation capability. To address this, we propose an audio-gated attention mechanism that dynamically selects semantically relevant audio cues for conditional audio filtering, and design an adaptive interval contrastive loss to mitigate ambiguous positive/negative sample boundaries and strengthen video-text alignment. Our approach is the first to incorporate gating mechanisms into joint tri-modal (video-audio-text) modeling, integrating multimodal feature fusion with contrastive learning. Evaluated on all major public benchmarks, our method achieves state-of-the-art performance, with particularly notable improvements in audio-semantically critical scenariosβ€”e.g., yielding substantial gains in retrieval accuracy where audio content is essential for disambiguation.

Technology Category

Application Category

πŸ“ Abstract
Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and textual features and often ignore audio, although it helps enhance overall comprehension of video content. Moreover, traditional models that incorporate audio blindly utilize the audio input regardless of whether it is useful or not, resulting in suboptimal video representation. To address these limitations, we propose a novel video-text retrieval framework, Audio-guided VIdeo representation learning with GATEd attention (AVIGATE), that effectively leverages audio cues through a gated attention mechanism that selectively filters out uninformative audio signals. In addition, we propose an adaptive margin-based contrastive loss to deal with the inherently unclear positive-negative relationship between video and text, which facilitates learning better video-text alignment. Our extensive experiments demonstrate that AVIGATE achieves state-of-the-art performance on all the public benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing video-text retrieval by integrating audio cues selectively
Addressing suboptimal video representation from indiscriminate audio usage
Improving video-text alignment with adaptive contrastive loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gated attention filters uninformative audio signals
Adaptive margin-based contrastive loss improves alignment
AVIGATE leverages audio cues for video retrieval
πŸ”Ž Similar Papers
No similar papers found.