🤖 AI Summary
This work addresses the limitation of existing video compositional retrieval methods, which focus solely on visual changes and neglect audio cues, thereby struggling in scenarios where audio and visual content co-evolve. To bridge this gap, we introduce a novel task termed Compositional Audio-Visual Retrieval (CoVA), which retrieves target videos that undergo specified transformations in both visual and auditory modalities from a reference video, guided by textual descriptions. We present AV-Comp, the first benchmark dataset designed for audio-visual compositional retrieval, and propose a new trimodal fusion paradigm featuring an Audio-Visual-Text (AVT) module that dynamically aligns the input text with the most relevant visual or audio modality, enabling selective cross-modal integration. Experimental results demonstrate that our approach substantially outperforms conventional unimodal strategies, establishing a strong baseline for the CoVA task.
📝 Abstract
Composed Video Retrieval (CoVR) aims to retrieve a target video from a large gallery using a reference video and a textual query specifying visual modifications. However, existing benchmarks consider only visual changes, ignoring videos that differ in audio despite visual similarity. To address this limitation, we introduce Composed retrieval for Video with its Audio CoVA, a new retrieval task that accounts for both visual and auditory variations. To support this, we construct AV-Comp, a benchmark consisting of video pairs with cross-modal changes and corresponding textual queries that describe the differences. We also propose AVT Compositional Fusion (AVT), which integrates video, audio, and text features by selectively aligning the query to the most relevant modality. AVT outperforms traditional unimodal fusion and serves as a strong baseline for CoVA. Examples from the proposed dataset, including both visual and auditory information, are available at https://perceptualai-lab.github.io/CoVA/.