π€ AI Summary
Existing methods struggle to achieve step-level human intention understanding in first-person videos, limiting the deployment of intelligent agents and robots in real-time reasoning tasks that require answering βwhat is being done,β βwhy it is being done,β and βwhat comes next.β To address this gap, this work introduces the first benchmark for step-level intention understanding, encompassing 3,014 annotated steps across 15 everyday scenarios. The benchmark evaluates multimodal large language models (MLLMs) along three dimensions: local intent (What), global intent (Why), and next-step planning (Next). By truncating videos just before critical action outcomes occur, the benchmark prevents future-frame leakage and enables a clean assessment of modelsβ prospective reasoning and planning capabilities. Evaluation of 15 state-of-the-art MLLMs reveals that even the best-performing model achieves only 33.31% average accuracy, underscoring the significant challenge this task presents.
π Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require understanding not only what a person is doing at each step, but also why and what comes next, in order to provide timely and context-aware support. To this end, we introduce EgoIntent, a step-level intent understanding benchmark for egocentric videos. It comprises 3,014 steps spanning 15 diverse indoor and outdoor daily-life scenarios, and evaluates models on three complementary dimensions: local intent (What), global intent (Why), and next-step plan (Next). Crucially, each clip is truncated immediately before the key outcome of the queried step (e.g., contact or grasp) occurs and contains no frames from subsequent steps, preventing future-frame leakage and enabling a clean evaluation of anticipatory step understanding and next-step planning. We evaluate 15 MLLMs, including both state-of-the-art closed-source and open-source models. Even the best-performing model achieves an average score of only 33.31 across the three intent dimensions, underscoring that step-level intent understanding in egocentric videos remains a highly challenging problem that calls for further investigation.