🤖 AI Summary
Video-to-Video Moment Retrieval (Vid2VidMR) aims to localize unseen event instances in a target video given a query video, posing core challenges in cross-video frame-level semantic alignment and long-range temporal dependency modeling. To address these, we propose MATR—a novel framework featuring: (1) a two-stage sequence alignment mechanism for fine-grained semantic matching between query and target videos; (2) a self-supervised pretraining strategy to optimize feature initialization; and (3) a joint head integrating foreground/background classification with boundary regression for moment localization. Evaluated on ActivityNet-VRL, MATR achieves +13.1% R@1 and +8.1% mIoU over prior methods. On our newly introduced SportsMoments benchmark—designed for fine-grained sports event retrieval—it further improves R@1 by 14.7% and mIoU by 14.4%, demonstrating substantial gains over state-of-the-art approaches.
📝 Abstract
Video-to-video moment retrieval (Vid2VidMR) is the task of localizing unseen events or moments in a target video using a query video. This task poses several challenges, such as the need for semantic frame-level alignment and modeling complex dependencies between query and target videos. To tackle this challenging problem, we introduce MATR (Moment Alignment TRansformer), a transformer-based model designed to capture semantic context as well as the temporal details necessary for precise moment localization. MATR conditions target video representations on query video features using dual-stage sequence alignment that encodes the required correlations and dependencies. These representations are then used to guide foreground/background classification and boundary prediction heads, enabling the model to accurately identify moments in the target video that semantically match with the query video. Additionally, to provide a strong task-specific initialization for MATR, we propose a self-supervised pre-training technique that involves training the model to localize random clips within videos. Extensive experiments demonstrate that MATR achieves notable performance improvements of 13.1% in R@1 and 8.1% in mIoU on an absolute scale compared to state-of-the-art methods on the popular ActivityNet-VRL dataset. Additionally, on our newly proposed dataset, SportsMoments, MATR shows a 14.7% gain in R@1 and a 14.4% gain in mIoU on an absolute scale over strong baselines.