Perceive, Reflect and Understand Long Video: Progressive Multi-Granular Clue Exploration with Interactive Agents

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-form video understanding faces challenges of high temporal complexity and sparse task-relevant information, making it difficult for existing large language model–based approaches to simultaneously ensure comprehensive key-information capture and efficient reasoning. This paper proposes CogniGPT, a novel framework featuring a dual-agent interactive architecture: a multi-granularity perception agent and a verification-enhanced reflection agent—inspired by human progressive visual cognition. The perception agent extracts multi-scale spatiotemporal cues, while the reflection agent performs attention-guided cue validation and iterative strategy refinement to enable dynamic error correction and hallucination suppression. This closed-loop reasoning system achieves state-of-the-art performance among training-free methods on four standard benchmarks—including EgoSchema—using only an average of 11.2 frames per video, matching Gemini 1.5-Pro’s accuracy while offering superior efficiency and robustness.

Technology Category

Application Category

📝 Abstract
Long videos, characterized by temporal complexity and sparse task-relevant information, pose significant reasoning challenges for AI systems. Although various Large Language Model (LLM)-based approaches have advanced long video understanding, they still struggle to achieve both completeness and efficiency in capturing task-critical information. Inspired by human progressive visual cognition, we propose CogniGPT, a framework that leverages an interactive loop between Multi-Granular Perception Agent (MGPA) and Verification-Enhanced Reflection Agent (VERA) for efficient and reliable long video understanding. Specifically, MGPA mimics human visual divergent and focused attention to capture task-related information, while VERA verifies perceived key clues to mitigate hallucination and optimize subsequent perception strategies. Through this interactive process, CogniGPT explores a minimal set of informative and reliable task-related clues. Extensive experiments on EgoSchema, Video-MME, NExT-QA, and MovieChat datasets demonstrate CogniGPT's superiority in both accuracy and efficiency. Notably, on EgoSchema, it surpasses existing training-free methods using only 11.2 frames and achieves performance comparable to Gemini 1.5-Pro.
Problem

Research questions and friction points this paper is trying to address.

Addresses inefficient task-critical information capture in long videos
Mitigates hallucination through verification-enhanced reflection mechanisms
Achieves accurate video understanding with minimal frame sampling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive loop between perception and reflection agents
Multi-granular perception mimicking human visual attention
Verification-enhanced reflection to mitigate hallucination
J
Jiahua Li
School of Electronic Engineering, Xidian University, Xi’an, China
Kun Wei
Kun Wei
School of Computer Science, Northwestern Polytechnical University
deep learningcompute sciencespeech
Z
Zhe Xu
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China
Z
Zibo Su
School of Electronic Engineering, Xidian University, Xi’an, China
X
Xu Yang
School of Electronic Engineering, Xidian University, Xi’an, China
Cheng Deng
Cheng Deng
University of Edinburgh
On-device LLMNLPGeoAI