🤖 AI Summary
This work addresses core challenges in fine-grained moment retrieval from long videos (avg. >500 seconds): difficulty in precise localization, lack of standardized evaluation protocols, and absence of dedicated benchmarks. To this end, we introduce LVMR-Bench—the first comprehensive benchmark for long-video moment retrieval—covering four task categories: Moment Search, Caption Alignment, Multimodal Conditional Retrieval, and Temporal Grounding. It spans diverse domains including sports, films, anime, and first-person videos. We propose MomentSeeker, a strong baseline integrating MLLM fine-tuning, synthetic data augmentation, and cross-modal alignment, evaluated on a human-annotated high-fidelity test set. Experiments demonstrate significant improvements over state-of-the-art methods across multiple LVMR tasks. Our analysis systematically identifies critical bottlenecks in existing multimodal retrievers, particularly in long-horizon temporal modeling, semantic cross-modal alignment, and instruction following. All data, code, and models are publicly released.
📝 Abstract
Retrieval augmented generation (RAG) holds great promise in addressing challenges associated with long video understanding. These methods retrieve useful moments from long videos for their presented tasks, thereby enabling multimodal large language models (MLLMs) to generate high-quality answers in a cost-effective way. In this work, we present MomentSeeker, a comprehensive benchmark to evaluate retrieval models' performance in handling general long-video moment retrieval (LVMR) tasks. MomentSeeker offers three key advantages. First, it incorporates long videos of over 500 seconds on average, making it the first benchmark specialized for long-video moment retrieval. Second, it covers a wide range of task categories (including Moment Search, Caption Alignment, Image-conditioned Moment Search, and Video-conditioned Moment Search) and diverse application scenarios (e.g., sports, movies, cartoons, and ego), making it a comprehensive tool for assessing retrieval models' general LVMR performance. Additionally, the evaluation tasks are carefully curated through human annotation, ensuring the reliability of assessment. We further fine-tune an MLLM-based LVMR retriever on synthetic data, which demonstrates strong performance on our benchmark. We perform extensive experiments with various popular multimodal retrievers based on our benchmark, whose results highlight the challenges of LVMR and limitations for existing methods. Our created resources will be shared with community to advance future research in this field.