Agentic Very Long Video Understanding

πŸ“… 2026-01-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing approaches struggle to achieve contextually coherent multi-hop reasoning over first-person video streams spanning days or weeks. To address this challenge, this work proposes the EGAgent framework, which centers on an entity-centric scene graph to structurally model people, locations, objects, and their temporal relationships. By integrating a planning agent, structured search, and cross-modal (vision–audio) retrieval, EGAgent enables temporally consistent multi-hop inference in ultra-long videos. The method achieves state-of-the-art performance with 57.5% accuracy on EgoLifeQA and attains 74.1% on Video-MME (Long), substantially advancing complex question answering capabilities for long-form egocentric video understanding.

Technology Category

Application Category

πŸ“ Abstract
The advent of always-on personal AI assistants, enabled by all-day wearable devices such as smart glasses, demands a new level of contextual understanding, one that goes beyond short, isolated events to encompass the continuous, longitudinal stream of egocentric video. Achieving this vision requires advances in long-horizon video understanding, where systems must interpret and recall visual and audio information spanning days or even weeks. Existing methods, including large language models and retrieval-augmented generation, are constrained by limited context windows and lack the ability to perform compositional, multi-hop reasoning over very long video streams. In this work, we address these challenges through EGAgent, an enhanced agentic framework centered on entity scene graphs, which represent people, places, objects, and their relationships over time. Our system equips a planning agent with tools for structured search and reasoning over these graphs, as well as hybrid visual and audio search capabilities, enabling detailed, cross-modal, and temporally coherent reasoning. Experiments on the EgoLifeQA and Video-MME (Long) datasets show that our method achieves state-of-the-art performance on EgoLifeQA (57.5%) and competitive performance on Video-MME (Long) (74.1%) for complex longitudinal video understanding tasks.
Problem

Research questions and friction points this paper is trying to address.

long-horizon video understanding
egocentric video
compositional reasoning
multi-hop reasoning
contextual understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Video Understanding
Entity Scene Graphs
Long-horizon Reasoning
Egocentric Video
Multimodal Retrieval
πŸ”Ž Similar Papers