MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing video retrieval models over-rely on visual modality while neglecting complementary semantic signals from text, speech, and audio, leading to cross-modal semantic misalignment and inadequate multilingual support. This paper proposes an information-need-driven multimodal video retrieval framework that breaks the vision-centric paradigm by jointly modeling visual, audio, subtitle text, and speech modalities, and natively supports cross-lingual queries. Key contributions include: (1) a modality-aware weighted reciprocal rank fusion (RRF) mechanism for dynamic, context-sensitive multimodal feature weighting; and (2) a modular, extensible multilingual multimodal retrieval architecture. Leveraging ViT, Whisper, and CLIP for multimodal feature extraction and alignment, our method achieves an 81% improvement in nDCG@20 over the best prior multimodal baseline, and a 37% gain over unimodal approaches, on MultiVENT 2.0 and TVR—demonstrating significantly enhanced fine-grained semantic matching capability.

Technology Category

Application Category

📝 Abstract

Videos inherently contain multiple modalities, including visual events, text overlays, sounds, and speech, all of which are important for retrieval. However, state-of-the-art multimodal language models like VAST and LanguageBind are built on vision-language models (VLMs), and thus overly prioritize visual signals. Retrieval benchmarks further reinforce this bias by focusing on visual queries and neglecting other modalities. We create a search system MMMORRF that extracts text and features from both visual and audio modalities and integrates them with a novel modality-aware weighted reciprocal rank fusion. MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs instead of visual descriptive queries. We evaluate MMMORRF on MultiVENT 2.0 and TVR, two multimodal benchmarks designed for more targeted information needs, and find that it improves nDCG@20 by 81% over leading multimodal encoders and 37% over single-modality retrieval, demonstrating the value of integrating diverse modalities.

Problem

Research questions and friction points this paper is trying to address.

Addresses bias in video retrieval favoring visual signals

Integrates text and features from audio and visual modalities

Improves search accuracy for diverse user information needs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts text and features from visual and audio modalities

Uses modality-aware weighted reciprocal rank fusion

Improves retrieval by integrating diverse modalities effectively

🔎 Similar Papers

No similar papers found.