Perceive, Reflect and Understand Long Video: Progressive Multi-Granular Clue Exploration with Interactive Agents

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

Long-form video understanding faces challenges of high temporal complexity and sparse task-relevant information, making it difficult for existing large language model–based approaches to simultaneously ensure comprehensive key-information capture and efficient reasoning. This paper proposes CogniGPT, a novel framework featuring a dual-agent interactive architecture: a multi-granularity perception agent and a verification-enhanced reflection agent—inspired by human progressive visual cognition. The perception agent extracts multi-scale spatiotemporal cues, while the reflection agent performs attention-guided cue validation and iterative strategy refinement to enable dynamic error correction and hallucination suppression. This closed-loop reasoning system achieves state-of-the-art performance among training-free methods on four standard benchmarks—including EgoSchema—using only an average of 11.2 frames per video, matching Gemini 1.5-Pro’s accuracy while offering superior efficiency and robustness.

Technology Category

Application Category

📝 Abstract

Long videos, characterized by temporal complexity and sparse task-relevant information, pose significant reasoning challenges for AI systems. Although various Large Language Model (LLM)-based approaches have advanced long video understanding, they still struggle to achieve both completeness and efficiency in capturing task-critical information. Inspired by human progressive visual cognition, we propose CogniGPT, a framework that leverages an interactive loop between Multi-Granular Perception Agent (MGPA) and Verification-Enhanced Reflection Agent (VERA) for efficient and reliable long video understanding. Specifically, MGPA mimics human visual divergent and focused attention to capture task-related information, while VERA verifies perceived key clues to mitigate hallucination and optimize subsequent perception strategies. Through this interactive process, CogniGPT explores a minimal set of informative and reliable task-related clues. Extensive experiments on EgoSchema, Video-MME, NExT-QA, and MovieChat datasets demonstrate CogniGPT's superiority in both accuracy and efficiency. Notably, on EgoSchema, it surpasses existing training-free methods using only 11.2 frames and achieves performance comparable to Gemini 1.5-Pro.

Problem

Research questions and friction points this paper is trying to address.

Addresses inefficient task-critical information capture in long videos

Mitigates hallucination through verification-enhanced reflection mechanisms

Achieves accurate video understanding with minimal frame sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive loop between perception and reflection agents

Multi-granular perception mimicking human visual attention

Verification-enhanced reflection to mitigate hallucination

🔎 Similar Papers

HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics