Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing large vision-language models, which rely on full video inputs and struggle with real-world streaming scenarios where frames arrive sequentially. To enable low-latency real-time inference, the authors propose the first truly concurrent streaming video reasoning framework by decoupling visual encoding from language generation. Key innovations include temporally aligned reasoning units, streaming-aware attention masks, dynamic positional encoding, and a dual KV-cache architecture, further enhanced by parallelized chain-of-thought generation and streaming-constrained training. Experiments on Qwen2.5-VL demonstrate that the proposed method significantly outperforms batched and interleaved baselines, achieving substantial reductions in both time-to-first-token (TTFT) and overall inference latency.

Technology Category

Application Category

📝 Abstract
Large Vision Language Models (LVLMs) exhibit strong Chain-of-Thought (CoT) capabilities, yet most existing paradigms assume full-video availability before inference, a batch-style process misaligned with real-world video streams where information arrives sequentially. Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs. The first, an interleaved paradigm, alternates between receiving frames and producing partial reasoning but remains constrained by strictly ordered cache updates. To better match streaming inputs, we propose \textbf{Think-as-You-See (TaYS)}, a unified framework enabling true concurrent reasoning. TaYS integrates parallelized CoT generation, stream-constrained training, and stream-parallel inference. It further employs temporally aligned reasoning units, streaming attention masks and positional encodings, and a dual KV-cache that decouples visual encoding from textual reasoning. We evaluate all paradigms on the Qwen2.5-VL family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experiments show that TaYS consistently outperforms both batch and interleaved baselines, improving reasoning performance while substantially reducing time-to-first-token (TTFT) and overall reasoning delay. These results demonstrate the effectiveness of data-aligned streaming reasoning in enabling efficient and responsive video understanding for LVLMs. We release our code at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/TaYS}{this repository.}
Problem

Research questions and friction points this paper is trying to address.

streaming reasoning
vision-language models
chain-of-thought
video understanding
real-time inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming reasoning
Chain-of-Thought
vision-language models
concurrent inference
dual KV-cache
🔎 Similar Papers
No similar papers found.
J
Jialiang Zhang
Institute of Digital Twin, Eastern Institute of Technology, Ningbo; Ocean University of China
J
Junlong Tong
Institute of Digital Twin, Eastern Institute of Technology, Ningbo; Shanghai Jiao Tong University
J
Junyan Lin
Institute of Digital Twin, Eastern Institute of Technology, Ningbo; The Hong Kong Polytechnic University
H
Hao Wu
Institute of Digital Twin, Eastern Institute of Technology, Ningbo
Y
Yirong Sun
Institute of Digital Twin, Eastern Institute of Technology, Ningbo; The University of Nottingham Ningbo China
Yunpu Ma
Yunpu Ma
Ludwig Maximilian University of Munich
Foundation ModelsAgentic AITemporal Knowledge GraphQuantum AI
Xiaoyu Shen
Xiaoyu Shen
Eastern Institute of Technology, Ningbo
language modelmulti-modal learningreasoning