When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing moment retrieval (MR) methods primarily focus on single-moment localization, failing to address the realistic “one-query-multiple-moments” requirement. This paper proposes FlashMMR, a novel multi-moment retrieval paradigm, and introduces QV-M²—the first high-quality, human-annotated multi-moment benchmark dataset. Methodologically, FlashMMR incorporates three key components: (1) cross-moment interaction modeling to capture inter-segment dependencies; (2) multi-stage post-verification for robust candidate filtering; and (3) constraint-aware temporal adjustment for precise boundary refinement of candidate segments and suppression of low-confidence proposals. Furthermore, it employs a resampling-rescoring mechanism with end-to-end optimization. On QV-M², FlashMMR achieves substantial improvements over state-of-the-art methods: +3.00% in global mean Average Precision (G-mAP), +2.70% in mAP@3+tgt, and +2.56% in mean Recall@3. This work establishes a strong baseline and a unified evaluation standard for multi-moment video temporal localization.

Technology Category

Application Category

📝 Abstract

Existing Moment retrieval (MR) methods focus on Single-Moment Retrieval (SMR). However, one query can correspond to multiple relevant moments in real-world applications. This makes the existing datasets and methods insufficient for video temporal grounding. By revisiting the gap between current MR tasks and real-world applications, we introduce a high-quality datasets called QVHighlights Multi-Moment Dataset (QV-M$^2$), along with new evaluation metrics tailored for multi-moment retrieval (MMR). QV-M$^2$ consists of 2,212 annotations covering 6,384 video segments. Building on existing efforts in MMR, we propose a framework called FlashMMR. Specifically, we propose a Multi-moment Post-verification module to refine the moment boundaries. We introduce constrained temporal adjustment and subsequently leverage a verification module to re-evaluate the candidate segments. Through this sophisticated filtering pipeline, low-confidence proposals are pruned, and robust multi-moment alignment is achieved. We retrain and evaluate 6 existing MR methods on QV-M$^2$ and QVHighlights under both SMR and MMR settings. Results show that QV-M$^2$ serves as an effective benchmark for training and evaluating MMR models, while FlashMMR provides a strong baseline. Specifically, on QV-M$^2$, it achieves improvements over prior SOTA method by 3.00% on G-mAP, 2.70% on mAP@3+tgt, and 2.56% on mR@3. The proposed benchmark and method establish a foundation for advancing research in more realistic and challenging video temporal grounding scenarios. Code is released at https://github.com/Zhuo-Cao/QV-M2.

Problem

Research questions and friction points this paper is trying to address.

Addressing multi-moment retrieval with cross-moment interactions

Introducing new datasets and metrics for video temporal grounding

Proposing a framework to refine moment boundaries and alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-moment Post-verification module refines moment boundaries

Constrained temporal adjustment optimizes video segment alignment

Verification module filters low-confidence proposals for robustness

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs