🤖 AI Summary
This work addresses the challenge of temporal localization in long, untrimmed first-person videos given natural language queries or goal-step descriptions. The authors propose a two-stage reranking framework: an initial candidate generation stage using OSGNet, followed by a reranking stage that leverages a multimodal large language model (MLLM) for fine-grained semantic matching. This approach is the first to harness the powerful video-language reasoning capabilities of MLLMs specifically for the reranking phase, achieving significantly improved localization accuracy while maintaining high recall efficiency. The method secured first place in both the Natural Language Queries and GoalStep tracks of the Ego4D 2026 Challenge.
📝 Abstract
In this report, we present our champion solutions for the Natural Language Queries and GoalStep tracks of the Ego4D Episodic Memory Challenge at CVPR 2026. Both tracks require accurately localizing temporal segments from long untrimmed egocentric videos. To address these tasks, we propose a reranking-based framework that effectively leverages the strong video-language reasoning capability of multimodal large language model (MLLM) while preserving the efficiency and candidate recall of conventional localization pipelines. Specifically, we first obtain a set of candidate segments from existing localization model OSGNet, and then employ MLLM to select the segment that best matches the given query, thereby refining the final prediction. Ultimately, our method achieved first place in both the Natural Language Queries and GoalStep tracks. Our code can be found at https://github.com/iLearn-Lab/CVPR25-OSGNet.