EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) face two key bottlenecks in egocentric video understanding: inferring latent intentions and recognizing fine-grained hand–object interactions. To address these challenges, we propose EgoRe—a novel framework featuring the first large-scale egocentric video question-answering dataset, EgoRe-5M, comprising 5 million QA pairs. Our method introduces spatiotemporal chain-of-thought supervision and a two-stage learning paradigm—supervised fine-tuning (SFT) followed by reasoning fine-tuning (RFT)—that jointly leverages dense hand–object association annotations and chained spatiotemporal reasoning. This design significantly enhances embodied reasoning and fine-grained spatiotemporal localization capabilities. Extensive experiments demonstrate that EgoRe achieves state-of-the-art performance across multiple benchmarks. Both code and data are publicly released.

Technology Category

Application Category

📝 Abstract
Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning RFT to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at https://github.com/InternRobotics/EgoThinker.
Problem

Research questions and friction points this paper is trying to address.

Enhancing egocentric reasoning in MLLMs for hidden intentions
Addressing fine-grained spatio-temporal localization in first-person videos
Bridging the gap between visible event and embodied understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces EgoThinker framework for egocentric video reasoning
Uses spatio-temporal chain-of-thought supervision approach
Implements two-stage learning curriculum with SFT and RFT
🔎 Similar Papers
No similar papers found.
Baoqi Pei
Baoqi Pei
Zhejiang University
Computer VisionMultimodal Learning
Y
Yifei Huang
Shanghai Artificial Intelligence Laboratory, The University of Tokyo
Jilan Xu
Jilan Xu
Fudan University
Computer VisionMultimodalMedical Image Analysis
Y
Yuping He
Nanjing University
G
Guo Chen
Nanjing University
F
Fei Wu
Zhejiang University
Y
Yu Qiao
Shanghai Artificial Intelligence Laboratory
J
Jiangmiao Pang
Shanghai Artificial Intelligence Laboratory