π€ AI Summary
To address zero-shot cross-modal retrieval for long videos, this paper proposes a dual-stream matching framework: it employs subtitle-driven unsupervised video segmentation for fine-grained temporal partitioning, jointly matches visual and auditory modalities, and introduces an audio-enhanced two-stage auditory retrieval mechanism. Key contributions include: (1) the first subtitle-guided unsupervised video segmentation strategy; (2) a novel audio-visualεε dual-stream architecture for zero-shot cross-modal retrieval; and (3) the first fine-grained evaluation protocol for long videos, enabling quantitative assessment of temporal localization accuracy. On the YouCook2 benchmark, our method achieves significant improvements in retrieval accuracy and demonstrates strong robustness to unseen vocabulary and complex scenes. This work establishes a new paradigm for long-video understanding and cross-modal retrieval.
π Abstract
Precise video retrieval requires multi-modal correlations to handle unseen vocabulary and scenes, becoming more complex for lengthy videos where models must perform effectively without prior training on a specific dataset. We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach. Additionally, the aural stream includes a complementary audio-based two-stage retrieval mechanism that enhances performance on long-duration videos. Considering the complex nature of retrieval from lengthy videos and its corresponding evaluation, we introduce a new retrieval evaluation method specifically designed for long-video retrieval to support further research. We conducted experiments on the YouCook2 benchmark, showing promising retrieval performance.