Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

This work addresses the limitations of existing video retrieval benchmarks, which are confined to closed video pools and exact text matching, thereby failing to support open-domain video search based on vague, multi-dimensional human memories. To bridge this gap, the authors introduce RVMS-Bench—a novel benchmark comprising 1,440 samples across 20 thematic categories and four duration types—and propose RACLO, an abductive reasoning agent framework. RVMS-Bench features a hierarchical description schema capturing global impressions, key moments, temporal context, and auditory memory, validated through a human-in-the-loop protocol. RACLO emulates the human cognitive process of “recall–search–verify” by leveraging multimodal large language models for open-domain video retrieval and moment localization. Experiments reveal that current multimodal large language models underperform on this task, while the proposed benchmark and method significantly enhance robustness in open-domain video search.

Technology Category

Application Category

📝 Abstract

Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. We present \textbf{RVMS-Bench}, a comprehensive system for evaluating real-world video memory search. It consists of \textbf{1,440 samples} spanning \textbf{20 diverse categories} and \textbf{four duration groups}, sourced from \textbf{real-world open-web videos}. RVMS-Bench utilizes a hierarchical description framework encompassing \textbf{Global Impression, Key Moment, Temporal Context, and Auditory Memory} to mimic realistic multi-dimensional search cues, with all samples strictly verified via a human-in-the-loop protocol. We further propose \textbf{RACLO}, an agentic framework that employs abductive reasoning to simulate the human ``Recall-Search-Verify''cognitive process, effectively addressing the challenge of searching for videos via fuzzy memories in the real world. Experiments reveal that existing MLLMs still demonstrate insufficient capabilities in real-world Video Retrieval and Moment Localization based on fuzzy memories. We believe this work will facilitate the advancement of video retrieval robustness in real-world unstructured scenarios.

Problem

Research questions and friction points this paper is trying to address.

video retrieval

moment localization

fuzzy memory

open-web video

real-world search

Innovation

Methods, ideas, or system contributions that make the work stand out.

real-world video retrieval

moment localization

fuzzy memory