EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses critical weaknesses of multimodal large language models (MLLMs) in egocentric–exocentric video cross-view understanding—specifically, insufficient semantic alignment, view correlation, and temporal reasoning. To this end, we introduce EgoExoBench, the first dedicated benchmark for this task. It comprises 11 subtasks that systematically define and quantify three core challenges. Built upon publicly available datasets, our multitask evaluation framework rigorously assesses 13 state-of-the-art MLLMs. Results reveal that while current models perform reasonably well on single-view understanding, they exhibit significant bottlenecks in cross-view semantic alignment and dynamic temporal reasoning. EgoExoBench fills a key gap in video understanding benchmarks and uncovers fundamental capability deficits, thereby providing concrete guidance for advancing model architectures, cross-view alignment mechanisms, and temporal modeling strategies.

Technology Category

Application Category

📝 Abstract
Transferring and integrating knowledge across first-person (egocentric) and third-person (exocentric) viewpoints is intrinsic to human intelligence, enabling humans to learn from others and convey insights from their own experiences. Despite rapid progress in multimodal large language models (MLLMs), their ability to perform such cross-view reasoning remains unexplored. To address this, we introduce EgoExoBench, the first benchmark for egocentric-exocentric video understanding and reasoning. Built from publicly available datasets, EgoExoBench comprises over 7,300 question-answer pairs spanning eleven sub-tasks organized into three core challenges: semantic alignment, viewpoint association, and temporal reasoning. We evaluate 13 state-of-the-art MLLMs and find that while these models excel on single-view tasks, they struggle to align semantics across perspectives, accurately associate views, and infer temporal dynamics in the ego-exo context. We hope EgoExoBench can serve as a valuable resource for research on embodied agents and intelligent assistants seeking human-like cross-view intelligence.
Problem

Research questions and friction points this paper is trying to address.

Assessing MLLMs' cross-view reasoning in video understanding
Evaluating semantic alignment across first- and third-person views
Testing temporal reasoning in ego-exo video contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces EgoExoBench for cross-view video understanding
Evaluates 13 MLLMs on semantic and temporal reasoning
Organizes tasks into three core cross-view challenges
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30
Y
Yuping He
Nanjing University, Shanghai AI Laboratory
Y
Yifei Huang
Shanghai AI Laboratory, The University of Tokyo
G
Guo Chen
Nanjing University
Baoqi Pei
Baoqi Pei
Zhejiang University
Computer VisionMultimodal Learning
Jilan Xu
Jilan Xu
Fudan University
Computer VisionMultimodalMedical Image Analysis
T
Tong Lu
Nanjing University
J
Jiangmiao Pang
Shanghai AI Laboratory