MAGMaR Shared Task System Description: Video Retrieval with OmniEmbed

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Cross-modal semantic alignment remains challenging in multilingual complex video retrieval. Method: This work pioneers the extension of the OmniEmbed model—originally built upon the Tevatron 2.0 framework—to the video modality, introducing an end-to-end joint fine-tuning strategy that jointly encodes text, keyframe images, raw audio waveforms, and video clips into a unified four-modal embedding space. The model is trained on the MultiVENT 2.0 multimodal dataset, and all weights are publicly released. Contribution/Results: Our approach achieves first place among public submissions on the MAGMaR shared task leaderboard (as of May 20, 2025), significantly improving Recall@10 and mean Average Precision (mAP) for multilingual video retrieval. These results empirically validate the effectiveness and generalizability of the unified multimodal embedding paradigm in realistic, multilingual video search scenarios.

Technology Category

Application Category

📝 Abstract

Effective video retrieval remains challenging due to the complexity of integrating visual, auditory, and textual modalities. In this paper, we explore unified retrieval methods using OmniEmbed, a powerful multimodal embedding model from the Tevatron 2.0 toolkit, in the context of the MAGMaR shared task. Evaluated on the comprehensive MultiVENT 2.0 dataset, OmniEmbed generates unified embeddings for text, images, audio, and video, enabling robust multimodal retrieval. By finetuning OmniEmbed with the combined multimodal data--visual frames, audio tracks, and textual descriptions provided in MultiVENT 2.0, we achieve substantial improvements in complex, multilingual video retrieval tasks. Our submission achieved the highest score on the MAGMaR shared task leaderboard among public submissions as of May 20th, 2025, highlighting the practical effectiveness of our unified multimodal retrieval approach. Model checkpoint in this work is opensourced.

Problem

Research questions and friction points this paper is trying to address.

Integrating visual, auditory, textual modalities for video retrieval

Improving multilingual video retrieval using unified embeddings

Enhancing multimodal retrieval with finetuned OmniEmbed model

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal retrieval with OmniEmbed

Finetuning on visual, audio, textual data

Achieves top score in MAGMaR task

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs