Aligning Moments in Time using Video Queries

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Video-to-Video Moment Retrieval (Vid2VidMR) aims to localize unseen event instances in a target video given a query video, posing core challenges in cross-video frame-level semantic alignment and long-range temporal dependency modeling. To address these, we propose MATR—a novel framework featuring: (1) a two-stage sequence alignment mechanism for fine-grained semantic matching between query and target videos; (2) a self-supervised pretraining strategy to optimize feature initialization; and (3) a joint head integrating foreground/background classification with boundary regression for moment localization. Evaluated on ActivityNet-VRL, MATR achieves +13.1% R@1 and +8.1% mIoU over prior methods. On our newly introduced SportsMoments benchmark—designed for fine-grained sports event retrieval—it further improves R@1 by 14.7% and mIoU by 14.4%, demonstrating substantial gains over state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Video-to-video moment retrieval (Vid2VidMR) is the task of localizing unseen events or moments in a target video using a query video. This task poses several challenges, such as the need for semantic frame-level alignment and modeling complex dependencies between query and target videos. To tackle this challenging problem, we introduce MATR (Moment Alignment TRansformer), a transformer-based model designed to capture semantic context as well as the temporal details necessary for precise moment localization. MATR conditions target video representations on query video features using dual-stage sequence alignment that encodes the required correlations and dependencies. These representations are then used to guide foreground/background classification and boundary prediction heads, enabling the model to accurately identify moments in the target video that semantically match with the query video. Additionally, to provide a strong task-specific initialization for MATR, we propose a self-supervised pre-training technique that involves training the model to localize random clips within videos. Extensive experiments demonstrate that MATR achieves notable performance improvements of 13.1% in R@1 and 8.1% in mIoU on an absolute scale compared to state-of-the-art methods on the popular ActivityNet-VRL dataset. Additionally, on our newly proposed dataset, SportsMoments, MATR shows a 14.7% gain in R@1 and a 14.4% gain in mIoU on an absolute scale over strong baselines.

Problem

Research questions and friction points this paper is trying to address.

Localizing unseen events in target videos using query videos

Addressing semantic frame-level alignment challenges in video retrieval

Modeling complex dependencies between query and target videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based model for video moment retrieval

Dual-stage sequence alignment for correlation encoding

Self-supervised pre-training with clip localization

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs