EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next

πŸ“… 2026-03-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing methods struggle to achieve step-level human intention understanding in first-person videos, limiting the deployment of intelligent agents and robots in real-time reasoning tasks that require answering β€œwhat is being done,” β€œwhy it is being done,” and β€œwhat comes next.” To address this gap, this work introduces the first benchmark for step-level intention understanding, encompassing 3,014 annotated steps across 15 everyday scenarios. The benchmark evaluates multimodal large language models (MLLMs) along three dimensions: local intent (What), global intent (Why), and next-step planning (Next). By truncating videos just before critical action outcomes occur, the benchmark prevents future-frame leakage and enables a clean assessment of models’ prospective reasoning and planning capabilities. Evaluation of 15 state-of-the-art MLLMs reveals that even the best-performing model achieves only 33.31% average accuracy, underscoring the significant challenge this task presents.

Technology Category

Application Category

πŸ“ Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require understanding not only what a person is doing at each step, but also why and what comes next, in order to provide timely and context-aware support. To this end, we introduce EgoIntent, a step-level intent understanding benchmark for egocentric videos. It comprises 3,014 steps spanning 15 diverse indoor and outdoor daily-life scenarios, and evaluates models on three complementary dimensions: local intent (What), global intent (Why), and next-step plan (Next). Crucially, each clip is truncated immediately before the key outcome of the queried step (e.g., contact or grasp) occurs and contains no frames from subsequent steps, preventing future-frame leakage and enabling a clean evaluation of anticipatory step understanding and next-step planning. We evaluate 15 MLLMs, including both state-of-the-art closed-source and open-source models. Even the best-performing model achieves an average score of only 33.31 across the three intent dimensions, underscoring that step-level intent understanding in egocentric videos remains a highly challenging problem that calls for further investigation.
Problem

Research questions and friction points this paper is trying to address.

egocentric video
step-level intent
human intent understanding
multimodal reasoning
anticipatory understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

step-level intent understanding
egocentric video
anticipatory reasoning
multimodal large language models
next-step planning
πŸ”Ž Similar Papers
No similar papers found.