🤖 AI Summary
This work addresses the challenge of video–text multi-instance retrieval in egocentric videos, where existing approaches often neglect temporal dynamics and the correlations inherent in soft labels. To overcome these limitations, the authors propose a two-stage reranking method based on a CLIP dual-encoder framework. The first stage employs a temporal Transformer on the video side to model inter-frame dependencies, while the second stage refines the initial top-K candidates using a cross-encoder equipped with an Image–Text Matching head. The entire pipeline is trained end-to-end with a symmetric multi-similarity loss that leverages soft labels. This approach effectively captures fine-grained cross-modal semantic alignments and temporal structure, achieving state-of-the-art performance on the EPIC-KITCHENS-100 MIR benchmark with an average mAP of 67.97% and an average nDCG of 82.92%.
📝 Abstract
Video-text retrieval has witnessed remarkable progress driven by large-scale vision-language pretraining, yet most existing approaches inherit an implicit assumption from image-text retrieval: that visual semantics can be captured frame-by-frame. This assumption overlooks the temporal dynamics of egocentric videos. The EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge further raises the bar by providing soft-label relevance matrices rather than binary labels, demanding models that can resolve graded semantic correspondences across modalities. In this report, we present our solution, termed TempRet, to the CVPR 2026 EPIC-KITCHENS-100 MIR challenge. Our approach builds upon a CLIP-based dual-encoder backbone and introduces two key components to address the temporal and cross-modal challenges. First, a temporal transformer operates exclusively on the video side, modeling inter-frame dependencies through learnable positional encodings and multi-head self-attention over frame-level CLIP features. Second, a two-stage reranking pipeline first retrieves Top-K candidates via the dual-encoder, then refines their scores using a cross-encoder equipped with an Image-Text Matching (ITM) head. The entire system is trained with Symmetric Multi-Similarity Loss to exploit the soft-label relevance matrices provided by the challenge. Our method achieves 67.97% average mAP and 82.92% average nDCG on the EK-100 MIR benchmark, demonstrating the effectiveness of temporal modeling and cross-modal refinement for egocentric video retrieval.