When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing moment retrieval (MR) methods primarily focus on single-moment localization, failing to address the realistic “one-query-multiple-moments” requirement. This paper proposes FlashMMR, a novel multi-moment retrieval paradigm, and introduces QV-M²—the first high-quality, human-annotated multi-moment benchmark dataset. Methodologically, FlashMMR incorporates three key components: (1) cross-moment interaction modeling to capture inter-segment dependencies; (2) multi-stage post-verification for robust candidate filtering; and (3) constraint-aware temporal adjustment for precise boundary refinement of candidate segments and suppression of low-confidence proposals. Furthermore, it employs a resampling-rescoring mechanism with end-to-end optimization. On QV-M², FlashMMR achieves substantial improvements over state-of-the-art methods: +3.00% in global mean Average Precision (G-mAP), +2.70% in mAP@3+tgt, and +2.56% in mean Recall@3. This work establishes a strong baseline and a unified evaluation standard for multi-moment video temporal localization.

Technology Category

Application Category

📝 Abstract
Existing Moment retrieval (MR) methods focus on Single-Moment Retrieval (SMR). However, one query can correspond to multiple relevant moments in real-world applications. This makes the existing datasets and methods insufficient for video temporal grounding. By revisiting the gap between current MR tasks and real-world applications, we introduce a high-quality datasets called QVHighlights Multi-Moment Dataset (QV-M$^2$), along with new evaluation metrics tailored for multi-moment retrieval (MMR). QV-M$^2$ consists of 2,212 annotations covering 6,384 video segments. Building on existing efforts in MMR, we propose a framework called FlashMMR. Specifically, we propose a Multi-moment Post-verification module to refine the moment boundaries. We introduce constrained temporal adjustment and subsequently leverage a verification module to re-evaluate the candidate segments. Through this sophisticated filtering pipeline, low-confidence proposals are pruned, and robust multi-moment alignment is achieved. We retrain and evaluate 6 existing MR methods on QV-M$^2$ and QVHighlights under both SMR and MMR settings. Results show that QV-M$^2$ serves as an effective benchmark for training and evaluating MMR models, while FlashMMR provides a strong baseline. Specifically, on QV-M$^2$, it achieves improvements over prior SOTA method by 3.00% on G-mAP, 2.70% on mAP@3+tgt, and 2.56% on mR@3. The proposed benchmark and method establish a foundation for advancing research in more realistic and challenging video temporal grounding scenarios. Code is released at https://github.com/Zhuo-Cao/QV-M2.
Problem

Research questions and friction points this paper is trying to address.

Addressing multi-moment retrieval with cross-moment interactions
Introducing new datasets and metrics for video temporal grounding
Proposing a framework to refine moment boundaries and alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-moment Post-verification module refines moment boundaries
Constrained temporal adjustment optimizes video segment alignment
Verification module filters low-confidence proposals for robustness
🔎 Similar Papers
No similar papers found.
Zhuo Cao
Zhuo Cao
Forschungszentrum Jülich
Artificial IntelligenceAstrophysics
Heming Du
Heming Du
The University of Queensland
computer vision
Bingqing Zhang
Bingqing Zhang
The University of Queensland, Australia
X
Xin Yu
The University of Queensland, Australia
X
Xue Li
The University of Queensland, Australia
S
Sen Wang
The University of Queensland, Australia