🤖 AI Summary
Current multimodal large language models (MLLMs) face two key bottlenecks in egocentric video understanding: inferring latent intentions and recognizing fine-grained hand–object interactions. To address these challenges, we propose EgoRe—a novel framework featuring the first large-scale egocentric video question-answering dataset, EgoRe-5M, comprising 5 million QA pairs. Our method introduces spatiotemporal chain-of-thought supervision and a two-stage learning paradigm—supervised fine-tuning (SFT) followed by reasoning fine-tuning (RFT)—that jointly leverages dense hand–object association annotations and chained spatiotemporal reasoning. This design significantly enhances embodied reasoning and fine-grained spatiotemporal localization capabilities. Extensive experiments demonstrate that EgoRe achieves state-of-the-art performance across multiple benchmarks. Both code and data are publicly released.
📝 Abstract
Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning RFT to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at https://github.com/InternRobotics/EgoThinker.