Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding

πŸ“… 2025-03-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address data scarcity and limited model capability in egocentric (first-person) video understanding, this paper proposes leveraging large-scale exocentric (third-person) visual knowledge for embodied vision-language modeling via cross-perspective knowledge transfer. Methodologically, we introduce the first end-to-end ego-exo collaborative pretraining paradigm and construct Ego-ExoClipβ€”a million-scale synchronized ego-exo video-text pair dataset. We design a three-stage progressive training framework integrating cross-perspective contrastive learning, multi-stage knowledge distillation, and synchronized multi-source alignment modeling. Additionally, we release EgoIT, a dedicated instruction-tuning dataset, and EgoBench, a comprehensive evaluation benchmark comprising eight diverse tasks. Extensive experiments on EgoBench demonstrate significant performance gains over state-of-the-art multimodal large language models (MLLMs), validating the efficacy of exocentric-to-egocentric knowledge transfer. This work establishes a novel paradigm for low-resource embodied vision-language modeling.

Technology Category

Application Category

πŸ“ Abstract
AI personal assistants, deployed through robots or wearables, require embodied understanding to collaborate effectively with humans. Current Multimodal Large Language Models (MLLMs) primarily focus on third-person (exocentric) vision, overlooking the unique aspects of first-person (egocentric) videos. Additionally, high acquisition costs limit data size, impairing MLLM performance. To address these challenges, we propose learning the mapping between exocentric and egocentric domains, leveraging the extensive exocentric knowledge within existing MLLMs to enhance egocentric video understanding. To this end, we introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs derived from Ego-Exo4D. Our approach features a progressive training pipeline with three stages: Teacher Self-Preparation, Teacher-Student Guidance, and Student Self-Practice. Additionally, we propose an instruction-tuning data EgoIT from multiple sources to strengthen the model's instruction-following capabilities, along with the EgoBench benchmark comprising eight different tasks for thorough evaluation. Extensive experiments across diverse egocentric tasks reveal that existing MLLMs perform inadequately in egocentric video understanding, while our model significantly outperforms these leading models.
Problem

Research questions and friction points this paper is trying to address.

Enhance egocentric video understanding using exocentric knowledge.
Address data scarcity in egocentric video datasets.
Improve MLLM performance on first-person video tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages exocentric knowledge for egocentric video understanding
Introduces Ego-ExoClip dataset with 1.1M clip-text pairs
Proposes progressive training pipeline with three stages
πŸ”Ž Similar Papers
No similar papers found.
H
Haoyu Zhang
Harbin Institute of Technology (Shenzhen), Pengcheng Laboratory
Qiaohui Chu
Qiaohui Chu
Harbin Institute of Technology (Shenzhen)
Multimodal AnalysisEgocentric Vision
M
Meng Liu
Shandong Jianzhu University
Yunxiao Wang
Yunxiao Wang
Phd student, Shandong University
Multimedia ComputingAffective ComputingInformation Retrieval
Bin Wen
Bin Wen
快手
MLLM
F
Fan Yang
Kuaishou
T
Tingting Gao
Kuaishou
D
Di Zhang
Kuaishou
Yaowei Wang
Yaowei Wang
The Hong Kong Polytechnic University
L
Liqiang Nie
Harbin Institute of Technology (Shenzhen)