MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This paper introduces Multimodal Untrimmed Video Retrieval (MUVR), a novel task addressing the challenge of precisely localizing relevant video segments in long untrimmed videos using multimodal queries—including lengthy natural language descriptions, video tags, and masked prompts. To support this task, we present MUVR, the first large-scale benchmark comprising 53K untrimmed videos, 1,050 diverse multimodal queries, and 84K fine-grained segment-query matching annotations across domains such as news, travel, and dance. Methodologically, we propose a six-level visual-semantic correspondence taxonomy, multi-granularity matching annotations, and a Reranking Score evaluation framework with three distinct evaluation protocols. Comprehensive evaluation of 19 state-of-the-art models reveals critical bottlenecks in long-range temporal modeling and cross-modal alignment. Our work establishes a new paradigm and a robust benchmark for untrimmed video understanding and multimodal foundation model assessment.

Technology Category

Application Category

📝 Abstract

We propose the Multi-modal Untrimmed Video Retrieval task, along with a new benchmark (MUVR) to advance video retrieval for long-video platforms. MUVR aims to retrieve untrimmed videos containing relevant segments using multi-modal queries. It has the following features: 1) Practical retrieval paradigm: MUVR supports video-centric multi-modal queries, expressing fine-grained retrieval needs through long text descriptions, video tag prompts, and mask prompts. It adopts a one-to-many retrieval paradigm and focuses on untrimmed videos, tailored for long-video platform applications. 2) Multi-level visual correspondence: To cover common video categories (e.g., news, travel, dance) and precisely define retrieval matching criteria, we construct multi-level visual correspondence based on core video content (e.g., news events, travel locations, dance moves) which users are interested in and want to retrieve. It covers six levels: copy, event, scene, instance, action, and others. 3) Comprehensive evaluation criteria: We develop 3 versions of MUVR (i.e., Base, Filter, QA). MUVR-Base/Filter evaluates retrieval models, while MUVR-QA assesses MLLMs in a question-answering format. We also propose a Reranking Score to evaluate the reranking ability of MLLMs. MUVR consists of 53K untrimmed videos from the video platform Bilibili, with 1,050 multi-modal queries and 84K matches. Extensive evaluations of 3 state-of-the-art video retrieval models, 6 image-based VLMs, and 10 MLLMs are conducted. MUVR reveals the limitations of retrieval methods in processing untrimmed videos and multi-modal queries, as well as MLLMs in multi-video understanding and reranking. Our code and benchmark is available at https://github.com/debby-0527/MUVR.

Problem

Research questions and friction points this paper is trying to address.

Retrieving untrimmed videos using multi-modal queries

Establishing multi-level visual correspondence for precise matching

Evaluating retrieval models and MLLMs on long-video platforms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal queries with text, tags, and mask prompts

Multi-level visual correspondence across six granularity levels

Comprehensive evaluation with three benchmark versions and reranking metrics

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs