GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Multimodal large language models (MLLMs) face fundamental limitations in long-video understanding due to constrained token capacity and weak temporal modeling, hindering global context capture and complex event relationship reasoning. To address this, we propose an agent-based reasoning framework featuring: (i) a structured schema coupled with narrative episodic memory, explicitly encoding causal and temporal dependencies among events; and (ii) a multi-stage “perceive–act–reflect” loop that enables dynamic context updating and retrieval-augmented reasoning. Our framework overcomes the long-horizon dependency bottleneck, achieving 73.4% accuracy on the Video-MME Long split—23.5 percentage points higher than the strongest baseline—and setting a new state-of-the-art for 7B-scale MLLMs. The overall average score reaches 71.9%, empirically validating the efficacy of global contextual awareness and deep event-level reasoning.

Technology Category

Application Category

📝 Abstract

Long-video understanding remains a significant challenge for Multimodal Large Language Models (MLLMs) due to inherent token limitations and the complexity of capturing long-term temporal dependencies. Existing methods often fail to capture the global context and complex event relationships necessary for deep video reasoning. To address this, we introduce GCAgent, a novel Global-Context-Aware Agent framework that achieves comprehensive long-video understanding. Our core innovation is the Schematic and Narrative Episodic Memory. This memory structurally models events and their causal and temporal relations into a concise, organized context, fundamentally resolving the long-term dependency problem. Operating in a multi-stage Perception-Action-Reflection cycle, our GCAgent utilizes a Memory Manager to retrieve relevant episodic context for robust, context-aware inference. Extensive experiments confirm that GCAgent significantly enhances long-video understanding, achieving up to 23.5% accuracy improvement on the Video-MME Long split over a strong MLLM baseline. Furthermore, our framework establishes state-of-the-art performance among comparable 7B-scale MLLMs, achieving 73.4% accuracy on the Long split and the highest overall average (71.9%) on the Video-MME benchmark, validating our agent-based reasoning paradigm and structured memory for cognitively-inspired long-video understanding.

Problem

Research questions and friction points this paper is trying to address.

Addressing long-video understanding challenges in MLLMs

Capturing global context and complex event relationships

Resolving long-term dependency issues in video analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Schematic and Narrative Episodic Memory models events

Multi-stage Perception-Action-Reflection cycle for reasoning

Memory Manager retrieves episodic context for inference

🔎 Similar Papers

HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics