Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Long-term temporal understanding of first-person videos spanning days to weeks remains a significant challenge due to extreme temporal sparsity and semantic complexity. Method: We propose Chain-of-Tool Thinking (CoTT), a reinforcement learning–based multimodal agent framework trained via Proximal Policy Optimization (PPO) that dynamically orchestrates specialized tools to decompose and solve subproblems—mimicking human stepwise reasoning through modular, composable tool chains. The method integrates supervised fine-tuning (SFT), LLM-driven agent control, and multimodal tool invocation. Contribution/Results: We introduce Ego-R1 Bench—the first week-scale first-person video question-answering benchmark—alongside its dedicated training dataset. Experiments demonstrate that CoTT extends effective video temporal coverage from hours to 7 days and substantially outperforms existing baselines on week-level QA tasks, validating the efficacy of tool-augmented long-horizon reasoning for egocentric video understanding.

Technology Category

Application Category

📝 Abstract

We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources. Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week.

Problem

Research questions and friction points this paper is trying to address.

Reasoning over ultra-long egocentric videos using modular tools

Dynamic tool selection for temporal retrieval and multimodal understanding

Extending video understanding coverage from hours to weeks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Tool-Thought for modular reasoning

Reinforcement learning for dynamic tool selection

Two-stage training with SFT and RL

🔎 Similar Papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models