Retrieving Any Relevant Moments: Benchmark and Models for Generalized Moment Retrieval

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the limitations of conventional video moment retrieval, which supports only single-moment matching and struggles with real-world queries involving multiple moments or no relevant segments. To overcome this, we propose a Generalized Moment Retrieval (GMR) framework that unifies the modeling of multi-moment, single-moment, and empty-set queries. We introduce Soccer-GMR, the first large-scale football video benchmark featuring realistic positive and negative samples along with flexible temporal annotations. An end-to-end evaluation protocol capable of handling empty-set predictions is designed, accompanied by a lightweight GMR adapter for discriminative models and a GRPO-based reward mechanism to fine-tune multimodal large language models. Experiments demonstrate significant performance gains over baselines across multiple metrics, revealing critical shortcomings of existing approaches in empty-set rejection and multi-moment localization, thereby advancing video-language understanding toward more realistic applications.

📝 Abstract

Video Moment Retrieval (VMR) aims to localize temporal segments in videos that correspond to a natural language query, but typically assumes only a single matching moment for each query. This assumption does not always hold in real-world scenarios, where queries may correspond to multiple or no moments. Thus, we formulate Generalized Moment Retrieval (GMR), a unified setting that requires retrieving the complete set of relevant moments or predicting an empty set. To enable systematic study of GMR, we introduce Soccer-GMR, a large-scale benchmark built on challenging soccer videos that reflect general GMR scenarios, with realistic negative and positive queries. The benchmark is constructed via a duration-flexible semi-automated pipeline with human verification, enabling scalable data generation while maintaining high annotation quality. We further design a unified evaluation protocol with complementary metrics tailored for null-set rejection, positive-query localization, and end-to-end GMR performance. Finally, we establish strong baselines across two modeling paradigms: a lightweight plug-and-play GMR adapter for discriminative VMR models, and a GMR-tailored GRPO reward for fine-tuning multimodal large language models (MLLMs). Extensive experiments show consistent gains across all metrics and expose key limitations of current methods, positioning GMR as a more realistic and challenging benchmark for video-language understanding.

Problem

Research questions and friction points this paper is trying to address.

Generalized Moment Retrieval

Video Moment Retrieval

Natural Language Query

Null-set Prediction

Temporal Localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalized Moment Retrieval

Soccer-GMR benchmark

null-set rejection