Thinking in Streaming Video

📅 2026-03-13

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the limitations of existing video understanding methods, which predominantly rely on batch processing and struggle to meet the low-latency and computational efficiency demands of real-time streaming scenarios. To overcome this, the authors propose ThinkStream, a novel framework grounded in the Watch–Think–Speak paradigm that enables incremental comprehension as new frames arrive and dynamically determines when to respond. Key innovations include a Reasoning-Compressed Streaming Memory (RCSM) mechanism, a streaming reinforcement learning training strategy, and a verifiable reward scheme, collectively facilitating effective integration of incremental reasoning with semantic memory compression. Experimental results demonstrate that ThinkStream significantly outperforms current online models across multiple streaming video benchmarks while maintaining low latency and minimal memory footprint.

Technology Category

Application Category

📝 Abstract

Real-time understanding of continuous video streams is essential for interactive assistants and multimodal agents operating in dynamic environments. However, most existing video reasoning approaches follow a batch paradigm that defers reasoning until the full video context is observed, resulting in high latency and growing computational cost that are incompatible with streaming scenarios. In this paper, we introduce ThinkStream, a framework for streaming video reasoning based on a Watch--Think--Speak paradigm that enables models to incrementally update their understanding as new video observations arrive. At each step, the model performs a short reasoning update and decides whether sufficient evidence has accumulated to produce a response. To support long-horizon streaming, we propose Reasoning-Compressed Streaming Memory (RCSM), which treats intermediate reasoning traces as compact semantic memory that replaces outdated visual tokens while preserving essential context. We further train the model using a Streaming Reinforcement Learning with Verifiable Rewards scheme that aligns incremental reasoning and response timing with the requirements of streaming interaction. Experiments on multiple streaming video benchmarks show that ThinkStream significantly outperforms existing online video models while maintaining low latency and memory usage. Code, models and data will be released at https://github.com/johncaged/ThinkStream

Problem

Research questions and friction points this paper is trying to address.

streaming video

real-time understanding

online video reasoning

low latency

computational cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming video reasoning

incremental reasoning

compressed memory