🤖 AI Summary
Long-term temporal understanding of first-person videos spanning days to weeks remains a significant challenge due to extreme temporal sparsity and semantic complexity.
Method: We propose Chain-of-Tool Thinking (CoTT), a reinforcement learning–based multimodal agent framework trained via Proximal Policy Optimization (PPO) that dynamically orchestrates specialized tools to decompose and solve subproblems—mimicking human stepwise reasoning through modular, composable tool chains. The method integrates supervised fine-tuning (SFT), LLM-driven agent control, and multimodal tool invocation.
Contribution/Results: We introduce Ego-R1 Bench—the first week-scale first-person video question-answering benchmark—alongside its dedicated training dataset. Experiments demonstrate that CoTT extends effective video temporal coverage from hours to 7 days and substantially outperforms existing baselines on week-level QA tasks, validating the efficacy of tool-augmented long-horizon reasoning for egocentric video understanding.
📝 Abstract
We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources. Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week.