Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the challenge of online streaming video understanding, which requires real-time processing of unbounded video streams under strict memory constraints while supporting arbitrary temporal queries—a task where existing methods struggle to balance semantic awareness and efficient memory management. The authors propose SAVEMem, a training-free framework that constructs a three-tier semantic-aware streaming memory and introduces a query-adaptive dynamic retrieval mechanism to jointly optimize memory compression and retrieval. Key innovations include pseudo-query-driven semantic memory stratification, anchor-conditioned recency gating, and late-stage interaction between query and memory tokens. Experiments demonstrate that SAVEMem boosts the OVO-Bench score from 52.27 to 62.69 on Qwen2.5-VL, achieves consistent gains on StreamingBench and ODV-Bench, and reduces peak GPU memory usage by 48% under a 128-frame setting.

📝 Abstract

Online streaming video understanding requires models to process continuous visual inputs and respond to user queries in real time, where the unbounded stream and unpredictable query timing turn memory management into a central challenge. Existing methods typically compress visual tokens via visual similarity heuristics, or augment compression with KV-cache-level retrieval. However, compression decisions rarely incorporate semantic signals, and retrieval is often added after compression is finalized, making the two stages hard to coordinate. We present SAVEMem, a training-free dual-stage framework that brings semantic awareness into memory generation and lets the retrieval scope adapt per query. In Stage~1, SAVEMem builds a three-tier streaming memory online under a constant memory budget. A fixed pseudo-question bank provides a lightweight semantic prior, so that long-term retention is shaped by semantic salience rather than visual similarity alone. In Stage~2, SAVEMem performs query-aware retrieval over this memory. An anchor-conditioned recency gate adapts the retrieval scope from short-term to mid- and long-term memory based on whether the query targets the present or the distant past. Within this scope, late interaction between query and memory tokens selects candidate frames for answering. Applied to Qwen2.5-VL without training, SAVEMem improves the OVO-Bench overall score from 52.27 to 62.69 and yields consistent gains on StreamingBench and ODV-Bench, while reducing peak GPU memory by 48\% at 128 frames over the backbone.

Problem

Research questions and friction points this paper is trying to address.

streaming video understanding

memory management

semantic awareness

query-aware retrieval

visual memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic-aware memory

streaming video understanding

query-adaptive retrieval