Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-video question answering (LVQA) suffers from memory explosion and information redundancy due to excessive contextual input. To address this, we propose a query-aware recurrent memory mechanism that iteratively processes video subclips, aggregates historical memory tokens, performs query-guided visual token compression, and recursively updates contextual representations—enabling precise localization and efficient reuse of critical evidence. We introduce dedicated memory tokens and establish GLVC, the first long-video referring benchmark. Our method processes up to 100K visual tokens (3,600 frames) under a 32K-context limit, requiring only 2 minutes and 37GB GPU memory. It significantly improves retrieval accuracy of key visual clues across multiple LVQA benchmarks, achieving a favorable trade-off among efficiency, capacity, and accuracy.

Technology Category

Application Category

📝 Abstract
Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model's context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model's long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.
Problem

Research questions and friction points this paper is trying to address.

Efficiently answer questions from long videos with limited memory
Seek critical clues from massive video information effectively
Reduce computational cost while maintaining video understanding accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Recurrent question-aware memory mechanism for critical clue seeking
Iterative sub-segment processing with purposefully compression strategy
Aggregating memory tokens to update history context recurrently
🔎 Similar Papers
No similar papers found.
H
Henghui Du
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing
C
Chang Zhou
AI Technology Center, Online Video Business Unit, Tencent PCG
Chunjie Zhang
Chunjie Zhang
Beijing Jiaotong University
multimediacomputer vision
X
Xi Chen
AI Technology Center, Online Video Business Unit, Tencent PCG
D
Di Hu
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing