Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing video retrieval benchmarks, which are confined to closed video pools and exact text matching, thereby failing to support open-domain video search based on vague, multi-dimensional human memories. To bridge this gap, the authors introduce RVMS-Bench—a novel benchmark comprising 1,440 samples across 20 thematic categories and four duration types—and propose RACLO, an abductive reasoning agent framework. RVMS-Bench features a hierarchical description schema capturing global impressions, key moments, temporal context, and auditory memory, validated through a human-in-the-loop protocol. RACLO emulates the human cognitive process of “recall–search–verify” by leveraging multimodal large language models for open-domain video retrieval and moment localization. Experiments reveal that current multimodal large language models underperform on this task, while the proposed benchmark and method significantly enhance robustness in open-domain video search.

Technology Category

Application Category

📝 Abstract
Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. We present \textbf{RVMS-Bench}, a comprehensive system for evaluating real-world video memory search. It consists of \textbf{1,440 samples} spanning \textbf{20 diverse categories} and \textbf{four duration groups}, sourced from \textbf{real-world open-web videos}. RVMS-Bench utilizes a hierarchical description framework encompassing \textbf{Global Impression, Key Moment, Temporal Context, and Auditory Memory} to mimic realistic multi-dimensional search cues, with all samples strictly verified via a human-in-the-loop protocol. We further propose \textbf{RACLO}, an agentic framework that employs abductive reasoning to simulate the human ``Recall-Search-Verify''cognitive process, effectively addressing the challenge of searching for videos via fuzzy memories in the real world. Experiments reveal that existing MLLMs still demonstrate insufficient capabilities in real-world Video Retrieval and Moment Localization based on fuzzy memories. We believe this work will facilitate the advancement of video retrieval robustness in real-world unstructured scenarios.
Problem

Research questions and friction points this paper is trying to address.

video retrieval
moment localization
fuzzy memory
open-web video
real-world search
Innovation

Methods, ideas, or system contributions that make the work stand out.

real-world video retrieval
moment localization
fuzzy memory
agent framework
abductive reasoning
🔎 Similar Papers
No similar papers found.