EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) face two key bottlenecks in egocentric video understanding: inferring latent intentions and recognizing fine-grained hand–object interactions. To address these challenges, we propose EgoRe—a novel framework featuring the first large-scale egocentric video question-answering dataset, EgoRe-5M, comprising 5 million QA pairs. Our method introduces spatiotemporal chain-of-thought supervision and a two-stage learning paradigm—supervised fine-tuning (SFT) followed by reasoning fine-tuning (RFT)—that jointly leverages dense hand–object association annotations and chained spatiotemporal reasoning. This design significantly enhances embodied reasoning and fine-grained spatiotemporal localization capabilities. Extensive experiments demonstrate that EgoRe achieves state-of-the-art performance across multiple benchmarks. Both code and data are publicly released.

Technology Category

Application Category

📝 Abstract

Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning RFT to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at https://github.com/InternRobotics/EgoThinker.

Problem

Research questions and friction points this paper is trying to address.

Enhancing egocentric reasoning in MLLMs for hidden intentions

Addressing fine-grained spatio-temporal localization in first-person videos

Bridging the gap between visible event and embodied understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces EgoThinker framework for egocentric video reasoning

Uses spatio-temporal chain-of-thought supervision approach

Implements two-stage learning curriculum with SFT and RFT

🔎 Similar Papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models