MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces Multimodal Untrimmed Video Retrieval (MUVR), a novel task addressing the challenge of precisely localizing relevant video segments in long untrimmed videos using multimodal queries—including lengthy natural language descriptions, video tags, and masked prompts. To support this task, we present MUVR, the first large-scale benchmark comprising 53K untrimmed videos, 1,050 diverse multimodal queries, and 84K fine-grained segment-query matching annotations across domains such as news, travel, and dance. Methodologically, we propose a six-level visual-semantic correspondence taxonomy, multi-granularity matching annotations, and a Reranking Score evaluation framework with three distinct evaluation protocols. Comprehensive evaluation of 19 state-of-the-art models reveals critical bottlenecks in long-range temporal modeling and cross-modal alignment. Our work establishes a new paradigm and a robust benchmark for untrimmed video understanding and multimodal foundation model assessment.

Technology Category

Application Category

📝 Abstract
We propose the Multi-modal Untrimmed Video Retrieval task, along with a new benchmark (MUVR) to advance video retrieval for long-video platforms. MUVR aims to retrieve untrimmed videos containing relevant segments using multi-modal queries. It has the following features: 1) Practical retrieval paradigm: MUVR supports video-centric multi-modal queries, expressing fine-grained retrieval needs through long text descriptions, video tag prompts, and mask prompts. It adopts a one-to-many retrieval paradigm and focuses on untrimmed videos, tailored for long-video platform applications. 2) Multi-level visual correspondence: To cover common video categories (e.g., news, travel, dance) and precisely define retrieval matching criteria, we construct multi-level visual correspondence based on core video content (e.g., news events, travel locations, dance moves) which users are interested in and want to retrieve. It covers six levels: copy, event, scene, instance, action, and others. 3) Comprehensive evaluation criteria: We develop 3 versions of MUVR (i.e., Base, Filter, QA). MUVR-Base/Filter evaluates retrieval models, while MUVR-QA assesses MLLMs in a question-answering format. We also propose a Reranking Score to evaluate the reranking ability of MLLMs. MUVR consists of 53K untrimmed videos from the video platform Bilibili, with 1,050 multi-modal queries and 84K matches. Extensive evaluations of 3 state-of-the-art video retrieval models, 6 image-based VLMs, and 10 MLLMs are conducted. MUVR reveals the limitations of retrieval methods in processing untrimmed videos and multi-modal queries, as well as MLLMs in multi-video understanding and reranking. Our code and benchmark is available at https://github.com/debby-0527/MUVR.
Problem

Research questions and friction points this paper is trying to address.

Retrieving untrimmed videos using multi-modal queries
Establishing multi-level visual correspondence for precise matching
Evaluating retrieval models and MLLMs on long-video platforms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal queries with text, tags, and mask prompts
Multi-level visual correspondence across six granularity levels
Comprehensive evaluation with three benchmark versions and reranking metrics
🔎 Similar Papers
Y
Yue Feng
MoE Key Laboratory of Brain-Machine Intelligence Technology, College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics
Jinwei Hu
Jinwei Hu
University of Liverpool
AI Safety and SecurityResponsible AIAI4ScienceExplainable AI
Q
Qijia Lu
MoE Key Laboratory of Brain-Machine Intelligence Technology, College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics
J
Jiawei Niu
MoE Key Laboratory of Brain-Machine Intelligence Technology, College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics
L
Li Tan
MoE Key Laboratory of Brain-Machine Intelligence Technology, College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics
Shuo Yuan
Shuo Yuan
Beijing University of Posts & Telecommunications
Satellite communicationEdge intelligenceIntegrated satellite-terrestrial networks
Z
Ziyi Yan
MoE Key Laboratory of Brain-Machine Intelligence Technology, College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics
Y
Yizhen Jia
MoE Key Laboratory of Brain-Machine Intelligence Technology, College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics
Q
Qingzhi He
MoE Key Laboratory of Brain-Machine Intelligence Technology, College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics
Shiping Ge
Shiping Ge
Independent Researcher
Multimodal LearningData Mining
E
Ethan Q. Chen
The Hong Kong Polytechnic University
Wentong Li
Wentong Li
Nanjing University of Aeronautics and Astronautics
Computer VisionMachine LearningVision-Language ModelRobotics
L
Limin Wang
Nanjing University
Jie Qin
Jie Qin
Professor, Nanjing University of Aeronautics and Astronautics
Computer VisionMachine LearningPattern Recognition