Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Long-video question answering (LVQA) suffers from memory explosion and information redundancy due to excessive contextual input. To address this, we propose a query-aware recurrent memory mechanism that iteratively processes video subclips, aggregates historical memory tokens, performs query-guided visual token compression, and recursively updates contextual representations—enabling precise localization and efficient reuse of critical evidence. We introduce dedicated memory tokens and establish GLVC, the first long-video referring benchmark. Our method processes up to 100K visual tokens (3,600 frames) under a 32K-context limit, requiring only 2 minutes and 37GB GPU memory. It significantly improves retrieval accuracy of key visual clues across multiple LVQA benchmarks, achieving a favorable trade-off among efficiency, capacity, and accuracy.

Technology Category

Application Category

📝 Abstract

Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model's context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model's long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.

Problem

Research questions and friction points this paper is trying to address.

Efficiently answer questions from long videos with limited memory

Seek critical clues from massive video information effectively

Reduce computational cost while maintaining video understanding accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recurrent question-aware memory mechanism for critical clue seeking

Iterative sub-segment processing with purposefully compression strategy

Aggregating memory tokens to update history context recurrently

🔎 Similar Papers

Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA