MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval

📅 2024-10-15

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing video retrieval datasets suffer from ambiguous queries, limited scale, monolingual bias, and insufficient multimodal coverage, hindering precise real-world event-centric cross-modal retrieval. To address this, we propose an event-centric paradigm and introduce the first large-scale multilingual news video retrieval benchmark—comprising 218K videos and 3,906 event-oriented queries—emphasizing fine-grained event semantic alignment and joint reasoning over visual, audio, OCR-extracted text, and metadata modalities. Our method integrates multimodal encoding, automatic speech recognition, and cross-lingual alignment into an end-to-end retrieval framework. Experiments reveal that state-of-the-art vision-language models achieve Recall@10 below 12%, confirming the benchmark’s substantial difficulty. This work establishes a rigorous, authoritative evaluation platform for robust multimodal event retrieval and introduces a novel research paradigm grounded in event semantics and multimodal synergy.

Technology Category

Application Category

📝 Abstract

Efficiently retrieving and synthesizing information from large-scale multimodal collections has become a critical challenge. However, existing video retrieval datasets suffer from scope limitations, primarily focusing on matching descriptive but vague queries with small collections of professionally edited, English-centric videos. To address this gap, we introduce $ extbf{MultiVENT 2.0}$, a large-scale, multilingual event-centric video retrieval benchmark featuring a collection of more than 218,000 news videos and 3,906 queries targeting specific world events. These queries specifically target information found in the visual content, audio, embedded text, and text metadata of the videos, requiring systems leverage all these sources to succeed at the task. Preliminary results show that state-of-the-art vision-language models struggle significantly with this task, and while alternative approaches show promise, they are still insufficient to adequately address this problem. These findings underscore the need for more robust multimodal retrieval systems, as effective video retrieval is a crucial step towards multimodal content understanding and generation.

Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in multilingual video retrieval.

Enhancing multimodal event-centric information synthesis.

Improving robustness in large-scale video datasets.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual event-centric video retrieval

Leveraging visual, audio, text metadata

Large-scale benchmark with 218,000 videos

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs