CoVA: Text-Guided Composed Video Retrieval for Audio-Visual Content

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the limitation of existing video compositional retrieval methods, which focus solely on visual changes and neglect audio cues, thereby struggling in scenarios where audio and visual content co-evolve. To bridge this gap, we introduce a novel task termed Compositional Audio-Visual Retrieval (CoVA), which retrieves target videos that undergo specified transformations in both visual and auditory modalities from a reference video, guided by textual descriptions. We present AV-Comp, the first benchmark dataset designed for audio-visual compositional retrieval, and propose a new trimodal fusion paradigm featuring an Audio-Visual-Text (AVT) module that dynamically aligns the input text with the most relevant visual or audio modality, enabling selective cross-modal integration. Experimental results demonstrate that our approach substantially outperforms conventional unimodal strategies, establishing a strong baseline for the CoVA task.

Technology Category

Application Category

📝 Abstract

Composed Video Retrieval (CoVR) aims to retrieve a target video from a large gallery using a reference video and a textual query specifying visual modifications. However, existing benchmarks consider only visual changes, ignoring videos that differ in audio despite visual similarity. To address this limitation, we introduce Composed retrieval for Video with its Audio CoVA, a new retrieval task that accounts for both visual and auditory variations. To support this, we construct AV-Comp, a benchmark consisting of video pairs with cross-modal changes and corresponding textual queries that describe the differences. We also propose AVT Compositional Fusion (AVT), which integrates video, audio, and text features by selectively aligning the query to the most relevant modality. AVT outperforms traditional unimodal fusion and serves as a strong baseline for CoVA. Examples from the proposed dataset, including both visual and auditory information, are available at https://perceptualai-lab.github.io/CoVA/.

Problem

Research questions and friction points this paper is trying to address.

Composed Video Retrieval

Audio-Visual Content

Multimodal Retrieval

Video-Audio Differences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Composed Video Retrieval

Audio-Visual Retrieval

Multimodal Fusion

Cross-Modal Alignment