Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

πŸ“… 2025-04-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address zero-shot cross-modal retrieval for long videos, this paper proposes a dual-stream matching framework: it employs subtitle-driven unsupervised video segmentation for fine-grained temporal partitioning, jointly matches visual and auditory modalities, and introduces an audio-enhanced two-stage auditory retrieval mechanism. Key contributions include: (1) the first subtitle-guided unsupervised video segmentation strategy; (2) a novel audio-visual协同 dual-stream architecture for zero-shot cross-modal retrieval; and (3) the first fine-grained evaluation protocol for long videos, enabling quantitative assessment of temporal localization accuracy. On the YouCook2 benchmark, our method achieves significant improvements in retrieval accuracy and demonstrates strong robustness to unseen vocabulary and complex scenes. This work establishes a new paradigm for long-video understanding and cross-modal retrieval.

Technology Category

Application Category

πŸ“ Abstract
Precise video retrieval requires multi-modal correlations to handle unseen vocabulary and scenes, becoming more complex for lengthy videos where models must perform effectively without prior training on a specific dataset. We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach. Additionally, the aural stream includes a complementary audio-based two-stage retrieval mechanism that enhances performance on long-duration videos. Considering the complex nature of retrieval from lengthy videos and its corresponding evaluation, we introduce a new retrieval evaluation method specifically designed for long-video retrieval to support further research. We conducted experiments on the YouCook2 benchmark, showing promising retrieval performance.
Problem

Research questions and friction points this paper is trying to address.

Multimodal lengthy video retrieval handling unseen vocabulary and scenes
Combining visual and aural streams with subtitle-based segmentation
New evaluation method for long-video retrieval performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines visual and aural matching streams
Uses subtitles-based video segmentation approach
Introduces new long-video retrieval evaluation method
πŸ”Ž Similar Papers
No similar papers found.
M
Mohamed Eltahir
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
O
Osamah Sarraj
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
M
Mohammed Bremoo
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
M
Mohammed Khurd
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
A
Abdulrahman Alfrihidi
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
T
Taha Alshatiri
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
M
Mohammad Almatrafi
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
Tanveer Hussain
Tanveer Hussain
Lecturer at Department of Computer Science, Edge Hill University
Computer VisionVideo SummarisationSaliency DetectionFire/Smoke Detection