🤖 AI Summary
This work addresses the fine-grained temporal localization problem in untrimmed first-person videos—specifically targeting the three tasks of natural language query grounding, target step localization, and moment retrieval defined in the Ego4D Episodic Memory Challenge. We propose the first end-to-end early-fusion unified model, departing from conventional late-fusion paradigms. Our approach jointly learns text-video temporal semantic representations via multimodal early alignment, temporal convolutional modeling, and cross-modal attention, further enhanced by contrastive learning to strengthen cross-modal temporal alignment. The model employs shared parameters and joint optimization across all three tasks. Evaluated on the Ego4D 2025 Challenge, it achieves first place in all three tracks—the first single-architecture solution to secure full-track supremacy—significantly outperforming existing baselines.
📝 Abstract
In this report, we present our champion solutions for the three egocentric video localization tracks of the Ego4D Episodic Memory Challenge at CVPR 2025. All tracks require precise localization of the interval within an untrimmed egocentric video. Previous unified video localization approaches often rely on late fusion strategies, which tend to yield suboptimal results. To address this, we adopt an early fusion-based video localization model to tackle all three tasks, aiming to enhance localization accuracy. Ultimately, our method achieved first place in the Natural Language Queries, Goal Step, and Moment Queries tracks, demonstrating its effectiveness. Our code can be found at https://github.com/Yisen-Feng/OSGNet.