CurveStream: Boosting Streaming Video Understanding in MLLMs via Curvature-Aware Hierarchical Visual Memory Management

๐Ÿ“… 2026-03-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenges in streaming video understanding where multimodal large language models suffer from linearly growing visual tokens, leading to memory overflow or catastrophic forgetting, while existing memory management approaches lack semantic awareness. To overcome these limitations, the authors propose CurveStreamโ€”a training-free, curvature-aware hierarchical memory management framework that, for the first time, links the geometric curvature of feature trajectories to semantic changes. By leveraging curvature scores to identify keyframes and integrating an online K-Sigma dynamic thresholding mechanism, CurveStream adaptively allocates crisp or fuzzy memory states under strict token budgets. The method achieves significant performance gains of 10.69% and 13.58% on StreamingBench and OVOBench, respectively, substantially outperforming current baselines and establishing a new state of the art in streaming video understanding.

Technology Category

Application Category

๐Ÿ“ Abstract
Multimodal Large Language Models have achieved significant success in offline video understanding, yet their application to streaming videos is severely limited by the linear explosion of visual tokens, which often leads to Out-of-Memory (OOM) errors or catastrophic forgetting. Existing visual retention and memory management methods typically rely on uniform sampling, low-level physical metrics, or passive cache eviction. However, these strategies often lack intrinsic semantic awareness, potentially disrupting contextual coherence and blurring transient yet critical semantic transitions. To address these limitations, we propose CurveStream, a training-free, curvature-aware hierarchical visual memory management framework. Our approach is motivated by the key observation that high-curvature regions along continuous feature trajectories closely align with critical global semantic transitions. Based on this geometric insight, CurveStream evaluates real-time semantic intensity via a Curvature Score and integrates an online K-Sigma dynamic threshold to adaptively route frames into clear and fuzzy memory states under a strict token budget. Evaluations across diverse temporal scales confirm that this lightweight framework, CurveStream, consistently yields absolute performance gains of over 10% (e.g., 10.69% on StreamingBench and 13.58% on OVOBench) over respective baselines, establishing new state-of-the-art results for streaming video perception.The code will be released at https://github.com/streamingvideos/CurveStream.
Problem

Research questions and friction points this paper is trying to address.

streaming video understanding
visual memory management
semantic transitions
multimodal large language models
Out-of-Memory
Innovation

Methods, ideas, or system contributions that make the work stand out.

curvature-aware
hierarchical memory management
streaming video understanding
semantic transition detection
token-efficient MLLMs
๐Ÿ”Ž Similar Papers
No similar papers found.