Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address high visual token redundancy, weak temporal modeling, and excessive inference latency in online understanding and real-time question answering for streaming long videos (several hours), this paper proposes a training-free, efficient visual token selection framework. Our method comprises three core components: (1) LLM attention-guided dynamic token pruning to retain semantically critical frames; (2) a history token recycling mechanism that explicitly models cross-segment temporal coherence; and (3) an integrated lightweight captioning module to enhance response accuracy and efficiency. The framework is fully compatible with standard Video-LLM architectures, requiring no additional parameters or fine-tuning. Evaluated on streaming video benchmarks, it achieves state-of-the-art performance—reducing redundant visual tokens by 95% with <0.5% accuracy degradation and accelerating inference by 3.2×.

Technology Category

Application Category

📝 Abstract

Video Large Language Models (Video-LLMs) excel at understanding videos in-context, provided they have full access to the video when answering queries. However, these models face challenges in streaming scenarios where hour-long videos must be processed online, and questions need timely responses. In this work, we propose a training-free approach compatible with standard Video-LLMs, leveraging three key concepts: 1) LLM-informed selection of visual tokens to identify those that the LLM has attended to and contributed to its understanding of each short clip. Our attention-based selection allows us to discard up to ~95% of unimportant visual tokens with minimal performance loss; 2) Recurrent processing of past selected tokens to generate temporally coherent understanding of each processed clip; 3) Caption-based question answering for lightweight and accurate responses. Our method achieves state-of-the-art performance on streaming video benchmarks, striking a balance between efficiency and effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Selects key visual tokens using LLM attention mechanisms

Processes past tokens recurrently for temporal coherence

Enables efficient streaming video understanding with minimal performance loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-informed visual token selection for efficiency

Recurrent processing of past tokens for coherence

Caption-based question answering for lightweight responses

🔎 Similar Papers

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges