🤖 AI Summary
To address the challenges of limited long-term memory storage and spatiotemporal reasoning latency in multimodal large language models (MLLMs) for real-time streaming video understanding, this paper proposes StreamForest. Methodologically, it introduces: (1) a persistent event-memory forest leveraging adaptive event-level tree structures for efficient long-term memory compression; (2) a fine-grained spatiotemporal windowing mechanism enabling stable inference under high compression ratios while preserving instantaneous perception; and (3) a novel penalty function integrating temporal distance, content similarity, and merge frequency, jointly optimized via OnlineIT instruction-tuning data and the ODV-Bench autonomous driving benchmark. Evaluated on StreamingBench, OVBench, and OVO-Bench, StreamForest achieves 77.3%, 60.5%, and 55.6% accuracy, respectively, and maintains 96.8% average accuracy under 1K-token compression—marking the first framework to simultaneously achieve high compression efficiency and inference stability in streaming video understanding.
📝 Abstract
Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in video understanding. However, their effectiveness in real-time streaming scenarios remains limited due to storage constraints of historical visual features and insufficient real-time spatiotemporal reasoning. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. Central to StreamForest is the Persistent Event Memory Forest, a memory mechanism that adaptively organizes video frames into multiple event-level tree structures. This process is guided by penalty functions based on temporal distance, content similarity, and merge frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce a Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we present OnlineIT, an instruction-tuning dataset tailored for streaming video tasks. OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction. To evaluate generalization in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results demonstrate that StreamForest achieves the state-of-the-art performance, with accuracies of 77.3% on StreamingBench, 60.5% on OVBench, and 55.6% on OVO-Bench. In particular, even under extreme visual token compression (limited to 1024 tokens), the model retains 96.8% of its average accuracy in eight benchmarks relative to the default setting. These results underscore the robustness, efficiency, and generalizability of StreamForest for streaming video understanding.