Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenges of real-time understanding of long video streams, which are hindered by redundant frame processing and contextual forgetting. Existing approaches often sacrifice critical temporal information due to fixed decoding strategies or aggressive cache pruning. To overcome this, we propose an event-driven framework that models video as a sequence of discrete semantic events, integrating motion, semantic, and predictive cues to detect state transitions. Language generation is triggered precisely at event boundaries, while a persistent memory bank enables long-horizon reasoning. Our approach introduces, for the first time, an event-aware mechanism that enables on-demand generation, achieving low latency without compromising long-term context retention. Built upon LLaMA-3-8B, our end-to-end system outperforms VideoLLM-Online-8B by 10.4 points on OVOBench-Realtime and maintains a ~70% win rate against GPT-5 on two-hour Ego4D streams, matching the performance of the specialized model Flash-VStream-7B.

Technology Category

Application Category

📝 Abstract

Real-time understanding of long video streams remains challenging for multimodal large language models (VLMs) due to redundant frame processing and rapid forgetting of past context. Existing streaming systems rely on fixed-interval decoding or cache pruning, which either produce repetitive outputs or discard crucial temporal information. We introduce Event-VStream, an event-aware framework that represents continuous video as a sequence of discrete, semantically coherent events. Our system detects meaningful state transitions by integrating motion, semantic, and predictive cues, and triggers language generation only at those boundaries. Each event embedding is consolidated into a persistent memory bank, enabling long-horizon reasoning while maintaining low latency. Across OVOBench-Realtime, and long-form Ego4D evaluations, Event-VStream achieves competitive performance. It improves over a VideoLLM-Online-8B baseline by +10.4 points on OVOBench-Realtime, achieves performance close to Flash-VStream-7B despite using only a general-purpose LLaMA-3-8B text backbone, and maintains around 70% GPT-5 win rate on 2-hour Ego4D streams.

Problem

Research questions and friction points this paper is trying to address.

real-time understanding

long video streams

multimodal large language models

temporal information

redundant frame processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

event-driven

video streaming

multimodal LLM