StreamForest: Efficient Online Video Understanding with Persistent Event Memory

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of limited long-term memory storage and spatiotemporal reasoning latency in multimodal large language models (MLLMs) for real-time streaming video understanding, this paper proposes StreamForest. Methodologically, it introduces: (1) a persistent event-memory forest leveraging adaptive event-level tree structures for efficient long-term memory compression; (2) a fine-grained spatiotemporal windowing mechanism enabling stable inference under high compression ratios while preserving instantaneous perception; and (3) a novel penalty function integrating temporal distance, content similarity, and merge frequency, jointly optimized via OnlineIT instruction-tuning data and the ODV-Bench autonomous driving benchmark. Evaluated on StreamingBench, OVBench, and OVO-Bench, StreamForest achieves 77.3%, 60.5%, and 55.6% accuracy, respectively, and maintains 96.8% average accuracy under 1K-token compression—marking the first framework to simultaneously achieve high compression efficiency and inference stability in streaming video understanding.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in video understanding. However, their effectiveness in real-time streaming scenarios remains limited due to storage constraints of historical visual features and insufficient real-time spatiotemporal reasoning. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. Central to StreamForest is the Persistent Event Memory Forest, a memory mechanism that adaptively organizes video frames into multiple event-level tree structures. This process is guided by penalty functions based on temporal distance, content similarity, and merge frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce a Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we present OnlineIT, an instruction-tuning dataset tailored for streaming video tasks. OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction. To evaluate generalization in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results demonstrate that StreamForest achieves the state-of-the-art performance, with accuracies of 77.3% on StreamingBench, 60.5% on OVBench, and 55.6% on OVO-Bench. In particular, even under extreme visual token compression (limited to 1024 tokens), the model retains 96.8% of its average accuracy in eight benchmarks relative to the default setting. These results underscore the robustness, efficiency, and generalizability of StreamForest for streaming video understanding.
Problem

Research questions and friction points this paper is trying to address.

Addresses real-time video understanding limitations in streaming scenarios
Overcomes storage constraints of historical visual features efficiently
Enhances spatiotemporal reasoning for continuous video analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Persistent Event Memory organizes frames into tree structures
Fine-grained Spatiotemporal Window captures short-term visual cues
OnlineIT dataset enhances real-time perception and prediction
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30
X
Xiangyu Zeng
Nanjing University
K
Kefan Qiu
Nanjing University
Qingyu Zhang
Qingyu Zhang
Institute of Software, Chinese Academy of Sciences
Xinhao Li
Xinhao Li
Nanjing University
Video UnderstandingMultimodal LLMVision-Language Learning
J
Jing Wang
Nanjing University
J
Jiaxin Li
Nanjing University
Z
Ziang Yan
Zhejiang University, Shanghai AI Laboratory
Kun Tian
Kun Tian
Intel
M
Meng Tian
Yinwang Intelligent Tech.
X
Xinhai Zhao
Noah’s Ark Lab, Huawei
Y
Yi Wang
Shanghai AI Laboratory
L
Limin Wang
Nanjing University