Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

📅 2025-04-06

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

To address zero-shot cross-modal retrieval for long videos, this paper proposes a dual-stream matching framework: it employs subtitle-driven unsupervised video segmentation for fine-grained temporal partitioning, jointly matches visual and auditory modalities, and introduces an audio-enhanced two-stage auditory retrieval mechanism. Key contributions include: (1) the first subtitle-guided unsupervised video segmentation strategy; (2) a novel audio-visual协同 dual-stream architecture for zero-shot cross-modal retrieval; and (3) the first fine-grained evaluation protocol for long videos, enabling quantitative assessment of temporal localization accuracy. On the YouCook2 benchmark, our method achieves significant improvements in retrieval accuracy and demonstrates strong robustness to unseen vocabulary and complex scenes. This work establishes a new paradigm for long-video understanding and cross-modal retrieval.

Technology Category

Application Category

📝 Abstract

Precise video retrieval requires multi-modal correlations to handle unseen vocabulary and scenes, becoming more complex for lengthy videos where models must perform effectively without prior training on a specific dataset. We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach. Additionally, the aural stream includes a complementary audio-based two-stage retrieval mechanism that enhances performance on long-duration videos. Considering the complex nature of retrieval from lengthy videos and its corresponding evaluation, we introduce a new retrieval evaluation method specifically designed for long-video retrieval to support further research. We conducted experiments on the YouCook2 benchmark, showing promising retrieval performance.

Problem

Research questions and friction points this paper is trying to address.

Multimodal lengthy video retrieval handling unseen vocabulary and scenes

Combining visual and aural streams with subtitle-based segmentation

New evaluation method for long-video retrieval performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines visual and aural matching streams

Uses subtitles-based video segmentation approach

Introduces new long-video retrieval evaluation method

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs