OSGNet @ Ego4D Episodic Memory Challenge 2025

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fine-grained temporal localization problem in untrimmed first-person videos—specifically targeting the three tasks of natural language query grounding, target step localization, and moment retrieval defined in the Ego4D Episodic Memory Challenge. We propose the first end-to-end early-fusion unified model, departing from conventional late-fusion paradigms. Our approach jointly learns text-video temporal semantic representations via multimodal early alignment, temporal convolutional modeling, and cross-modal attention, further enhanced by contrastive learning to strengthen cross-modal temporal alignment. The model employs shared parameters and joint optimization across all three tasks. Evaluated on the Ego4D 2025 Challenge, it achieves first place in all three tracks—the first single-architecture solution to secure full-track supremacy—significantly outperforming existing baselines.

Technology Category

Application Category

📝 Abstract
In this report, we present our champion solutions for the three egocentric video localization tracks of the Ego4D Episodic Memory Challenge at CVPR 2025. All tracks require precise localization of the interval within an untrimmed egocentric video. Previous unified video localization approaches often rely on late fusion strategies, which tend to yield suboptimal results. To address this, we adopt an early fusion-based video localization model to tackle all three tasks, aiming to enhance localization accuracy. Ultimately, our method achieved first place in the Natural Language Queries, Goal Step, and Moment Queries tracks, demonstrating its effectiveness. Our code can be found at https://github.com/Yisen-Feng/OSGNet.
Problem

Research questions and friction points this paper is trying to address.

Precise interval localization in untrimmed egocentric videos
Overcoming suboptimal late fusion in video localization
Enhancing accuracy across multiple episodic memory tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early fusion-based video localization model
Enhanced accuracy in interval localization
Unified approach for multiple tasks
🔎 Similar Papers
No similar papers found.
Yisen Feng
Yisen Feng
Harbin Institute of Technology (Shenzhen)
Multimodal Analysis
H
Haoyu Zhang
Harbin Institute of Technology (Shenzhen), Pengcheng Laboratory
Qiaohui Chu
Qiaohui Chu
Harbin Institute of Technology (Shenzhen)
Multimodal AnalysisEgocentric Vision
M
Meng Liu
Shandong Jianzhu University
W
Weili Guan
Harbin Institute of Technology (Shenzhen)
Yaowei Wang
Yaowei Wang
The Hong Kong Polytechnic University
L
Liqiang Nie
Harbin Institute of Technology (Shenzhen)