OSGNet @ Ego4D Episodic Memory Challenge 2025

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the fine-grained temporal localization problem in untrimmed first-person videos—specifically targeting the three tasks of natural language query grounding, target step localization, and moment retrieval defined in the Ego4D Episodic Memory Challenge. We propose the first end-to-end early-fusion unified model, departing from conventional late-fusion paradigms. Our approach jointly learns text-video temporal semantic representations via multimodal early alignment, temporal convolutional modeling, and cross-modal attention, further enhanced by contrastive learning to strengthen cross-modal temporal alignment. The model employs shared parameters and joint optimization across all three tasks. Evaluated on the Ego4D 2025 Challenge, it achieves first place in all three tracks—the first single-architecture solution to secure full-track supremacy—significantly outperforming existing baselines.

Technology Category

Application Category

📝 Abstract

In this report, we present our champion solutions for the three egocentric video localization tracks of the Ego4D Episodic Memory Challenge at CVPR 2025. All tracks require precise localization of the interval within an untrimmed egocentric video. Previous unified video localization approaches often rely on late fusion strategies, which tend to yield suboptimal results. To address this, we adopt an early fusion-based video localization model to tackle all three tasks, aiming to enhance localization accuracy. Ultimately, our method achieved first place in the Natural Language Queries, Goal Step, and Moment Queries tracks, demonstrating its effectiveness. Our code can be found at https://github.com/Yisen-Feng/OSGNet.

Problem

Research questions and friction points this paper is trying to address.

Precise interval localization in untrimmed egocentric videos

Overcoming suboptimal late fusion in video localization

Enhancing accuracy across multiple episodic memory tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Early fusion-based video localization model

Enhanced accuracy in interval localization

Unified approach for multiple tasks

🔎 Similar Papers

No similar papers found.