Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the limitations of existing video large language models in streaming scenarios, where the absence of synchronous reasoning leads to high latency and incoherent cognition during real-time interaction. To overcome this, the authors propose the Video Streaming Thinking (VST) paradigm, which enables continuous, synchronized logical reasoning through a “think-while-watching” mechanism during video playback. Key innovations include VST-SFT for structured supervised fine-tuning, VST-RL for multi-turn interactive reinforcement learning, an entity-relation-guided streaming chain-of-thought, and automated question-answer pair generation grounded in video knowledge graphs. Experimental results demonstrate that VST-7B achieves 79.5% on StreamingBench and 59.3% on OVO-Bench, offering a 15.7× faster response time than Video-R1 and a 5.4% improvement in VideoHolmes scores.

Technology Category

Application Category

📝 Abstract

Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which provides end-to-end improvement through self-exploration in a multi-turn video interaction environment. Additionally, we devise an automated training-data synthesis pipeline that uses video knowledge graphs to generate high-quality streaming QA pairs, with an entity-relation grounded streaming Chain-of-Thought to enforce multi-evidence reasoning and sustained attention to the video stream. Extensive evaluations show that VST-7B performs strongly on online benchmarks, e.g. 79.5% on StreamingBench and 59.3% on OVO-Bench. Meanwhile, VST remains competitive on offline long-form or reasoning benchmarks. Compared with Video-R1, VST responds 15.7 times faster and achieves +5.4% improvement on VideoHolmes, demonstrating higher efficiency and strong generalization across diverse video understanding tasks. Code, data, and models will be released at https://github.com/1ranGuan/VST.

Problem

Research questions and friction points this paper is trying to address.

Video Streaming

Real-time Interaction

Logical Reasoning

Response Latency

Video Understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video Streaming Thinking

Streaming Reasoning

Causal Video Understanding