🤖 AI Summary
Existing video retrieval models over-rely on visual modality while neglecting complementary semantic signals from text, speech, and audio, leading to cross-modal semantic misalignment and inadequate multilingual support. This paper proposes an information-need-driven multimodal video retrieval framework that breaks the vision-centric paradigm by jointly modeling visual, audio, subtitle text, and speech modalities, and natively supports cross-lingual queries. Key contributions include: (1) a modality-aware weighted reciprocal rank fusion (RRF) mechanism for dynamic, context-sensitive multimodal feature weighting; and (2) a modular, extensible multilingual multimodal retrieval architecture. Leveraging ViT, Whisper, and CLIP for multimodal feature extraction and alignment, our method achieves an 81% improvement in nDCG@20 over the best prior multimodal baseline, and a 37% gain over unimodal approaches, on MultiVENT 2.0 and TVR—demonstrating significantly enhanced fine-grained semantic matching capability.
📝 Abstract
Videos inherently contain multiple modalities, including visual events, text overlays, sounds, and speech, all of which are important for retrieval. However, state-of-the-art multimodal language models like VAST and LanguageBind are built on vision-language models (VLMs), and thus overly prioritize visual signals. Retrieval benchmarks further reinforce this bias by focusing on visual queries and neglecting other modalities. We create a search system MMMORRF that extracts text and features from both visual and audio modalities and integrates them with a novel modality-aware weighted reciprocal rank fusion. MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs instead of visual descriptive queries. We evaluate MMMORRF on MultiVENT 2.0 and TVR, two multimodal benchmarks designed for more targeted information needs, and find that it improves nDCG@20 by 81% over leading multimodal encoders and 37% over single-modality retrieval, demonstrating the value of integrating diverse modalities.