EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the limitations of existing video-language models, which are predominantly designed for offline settings and lack the interactive capability to determine optimal response timing in streaming video scenarios; moreover, current evaluation protocols fail to faithfully capture the trade-off between response latency and accuracy. To overcome these challenges, the authors propose EvoStreaming, a novel framework featuring a self-evolving streaming adaptation mechanism. Without requiring external supervision or architectural modifications, EvoStreaming enables models to generate their own streaming trajectories, automatically annotate relevance, and refine response strategies through policy rollback. Remarkably, it achieves effective strategy learning using only 1,000 self-generated samples—139 times fewer than mainstream approaches. Evaluated across five prominent VideoLLMs, the method improves average RealStreamEval scores by 10.8 points while preserving near-original offline comprehension performance.

📝 Abstract

Streaming video understanding demands more than watching longer videos: assistants must decide when to speak in real time, balancing responsiveness against verbosity. Yet most video-language models (VideoLLMs) are trained for offline inference, and existing streaming benchmarks externalize this timing decision to the evaluator. We address this gap with RealStreamEval, a frame-level multi-turn evaluation protocol that exposes models to sequential observations and penalizes unnecessary responses. Under this protocol, we observed that strong offline VideoLLMs retain useful visual understanding but lack an interaction policy for deciding when to respond. Motivated by this observation, we propose EvoStreaming, a self-evolved streaming adaptation framework in which the base model itself acts as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories without external supervision. With only $1{,}000$ self-generated samples ($139\times$ less than the leading streaming instruction-tuning approach) and no architectural changes, EvoStreaming consistently improves the overall RealStreamEval score by up to $10.8$ points across five open VideoLLM backbones (Qwen2/2.5/3-VL, InternVL-3.5, MiniCPM-V4.5) while largely preserving offline video performance. These results suggest that data-efficient interaction tuning is a practical path for adapting existing VideoLLMs to streaming assistants.

Problem

Research questions and friction points this paper is trying to address.

streaming video understanding

video-language models

real-time response timing

interaction policy

streaming evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

EvoStreaming

streaming video understanding

self-evolved adaptation