Beyond Caption-Based Queries for Video Moment Retrieval

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the significant performance drop observed in existing video moment retrieval methods when transferring from descriptive captions to real-world search queries, particularly under multi-moment scenarios where limited linguistic expressiveness and poor generalization pose dual challenges. The study systematically uncovers the linguistic gap and multi-moment gap inherent in this transfer and introduces three new benchmark datasets—HD-EPIC, YouCook2, and ActivityNet-Captions—to facilitate research in this direction. To mitigate query collapse in DETR-based architectures, the authors propose a query enhancement mechanism coupled with explicit multi-moment modeling. Experimental results demonstrate substantial improvements, with gains of up to 14.82% in mAP_m on search queries and up to 21.83% on multi-moment queries.

Technology Category

Application Category

📝 Abstract

In this work, we investigate the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries. For this, we introduce three benchmarks by modifying the textual queries in three public VMR datasets -- i.e., HD-EPIC, YouCook2 and ActivityNet-Captions. Our analysis reveals two key generalization challenges: (i) A language gap, arising from the linguistic under-specification of search queries, and (ii) a multi-moment gap, caused by the shift from single-moment to multi-moment queries. We also identify a critical issue in these architectures -- an active decoder-query collapse -- as a primary cause of the poor generalization to multi-moment instances. We mitigate this issue with architectural modifications that effectively increase the number of active decoder queries. Extensive experiments demonstrate that our approach improves performance on search queries by up to 14.82% mAP_m, and up to 21.83% mAP_m on multi-moment search queries. The code, models and data are available in the project webpage: https://davidpujol.github.io/beyond-vmr/

Problem

Research questions and friction points this paper is trying to address.

video moment retrieval

search queries

language gap

multi-moment gap

generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

video moment retrieval

DETR architecture

decoder-query collapse