Thinking in Streaming Video

๐Ÿ“… 2026-03-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitations of existing video understanding methods, which predominantly rely on batch processing and struggle to meet the low-latency and computational efficiency demands of real-time streaming scenarios. To overcome this, the authors propose ThinkStream, a novel framework grounded in the Watchโ€“Thinkโ€“Speak paradigm that enables incremental comprehension as new frames arrive and dynamically determines when to respond. Key innovations include a Reasoning-Compressed Streaming Memory (RCSM) mechanism, a streaming reinforcement learning training strategy, and a verifiable reward scheme, collectively facilitating effective integration of incremental reasoning with semantic memory compression. Experimental results demonstrate that ThinkStream significantly outperforms current online models across multiple streaming video benchmarks while maintaining low latency and minimal memory footprint.

Technology Category

Application Category

๐Ÿ“ Abstract
Real-time understanding of continuous video streams is essential for interactive assistants and multimodal agents operating in dynamic environments. However, most existing video reasoning approaches follow a batch paradigm that defers reasoning until the full video context is observed, resulting in high latency and growing computational cost that are incompatible with streaming scenarios. In this paper, we introduce ThinkStream, a framework for streaming video reasoning based on a Watch--Think--Speak paradigm that enables models to incrementally update their understanding as new video observations arrive. At each step, the model performs a short reasoning update and decides whether sufficient evidence has accumulated to produce a response. To support long-horizon streaming, we propose Reasoning-Compressed Streaming Memory (RCSM), which treats intermediate reasoning traces as compact semantic memory that replaces outdated visual tokens while preserving essential context. We further train the model using a Streaming Reinforcement Learning with Verifiable Rewards scheme that aligns incremental reasoning and response timing with the requirements of streaming interaction. Experiments on multiple streaming video benchmarks show that ThinkStream significantly outperforms existing online video models while maintaining low latency and memory usage. Code, models and data will be released at https://github.com/johncaged/ThinkStream
Problem

Research questions and friction points this paper is trying to address.

streaming video
real-time understanding
online video reasoning
low latency
computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming video reasoning
incremental reasoning
compressed memory
reinforcement learning
real-time multimodal agents
๐Ÿ”Ž Similar Papers
No similar papers found.
Zikang Liu
Zikang Liu
Institute of Automation, Chinese Academy of Sciences
Multimodal LearningComputer Vision
L
Longteng Guo
Institute of Automation, Chinese Academy of Sciences
Handong Li
Handong Li
Institute of Automation, Chinese Academy of Sciences
cross-modality pretraining
R
Ru Zhen
OPPO AI Center, OPPO Inc.
Xingjian He
Xingjian He
Institute of Automation of the Chinese Academy Sciences (CASIA)
computer visionsemantic segmentation
Ruyi Ji
Ruyi Ji
University of Michigan
Program Synthesis
X
Xiaoming Ren
OPPO AI Center, OPPO Inc.
Yanhao Zhang
Yanhao Zhang
Alibaba Damo Academy, OPPO AI Center
MLLMAIGCFoundation Models
H
Haonan Lu
OPPO AI Center, OPPO Inc.
J
Jing Liu
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences