Agentic Very Long Video Understanding

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

Existing approaches struggle to achieve contextually coherent multi-hop reasoning over first-person video streams spanning days or weeks. To address this challenge, this work proposes the EGAgent framework, which centers on an entity-centric scene graph to structurally model people, locations, objects, and their temporal relationships. By integrating a planning agent, structured search, and cross-modal (vision–audio) retrieval, EGAgent enables temporally consistent multi-hop inference in ultra-long videos. The method achieves state-of-the-art performance with 57.5% accuracy on EgoLifeQA and attains 74.1% on Video-MME (Long), substantially advancing complex question answering capabilities for long-form egocentric video understanding.

Technology Category

Application Category

📝 Abstract

The advent of always-on personal AI assistants, enabled by all-day wearable devices such as smart glasses, demands a new level of contextual understanding, one that goes beyond short, isolated events to encompass the continuous, longitudinal stream of egocentric video. Achieving this vision requires advances in long-horizon video understanding, where systems must interpret and recall visual and audio information spanning days or even weeks. Existing methods, including large language models and retrieval-augmented generation, are constrained by limited context windows and lack the ability to perform compositional, multi-hop reasoning over very long video streams. In this work, we address these challenges through EGAgent, an enhanced agentic framework centered on entity scene graphs, which represent people, places, objects, and their relationships over time. Our system equips a planning agent with tools for structured search and reasoning over these graphs, as well as hybrid visual and audio search capabilities, enabling detailed, cross-modal, and temporally coherent reasoning. Experiments on the EgoLifeQA and Video-MME (Long) datasets show that our method achieves state-of-the-art performance on EgoLifeQA (57.5%) and competitive performance on Video-MME (Long) (74.1%) for complex longitudinal video understanding tasks.

Problem

Research questions and friction points this paper is trying to address.

long-horizon video understanding

egocentric video

compositional reasoning

multi-hop reasoning

contextual understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Video Understanding

Entity Scene Graphs

Long-horizon Reasoning