Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing vision-language models are limited in multi-turn interactive video reasoning due to sequential perception-generation pipelines and long-term memory decay. This work proposes a segment-level memory-anchored streaming reasoning framework that enables parallel “watching-and-thinking” processing, enforces strict causality through segment-level causal masking and streaming positional encoding, and enhances reasoning capabilities via a three-stage chain-of-thought dataset and stage-aligned training strategy. Implemented on Qwen3-VL, the resulting efficient inference pipeline achieves absolute accuracy gains of 2.6% and 3.79% on StreamingBench and OVO-Bench, respectively, in single-turn settings. In multi-turn scenarios, it reduces output tokens by 56% while maintaining stable performance.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single-round and multi-round streaming input protocols, our method achieves strong results. Built on Qwen3-VL, it improves single-round accuracy by 2.6% on StreamingBench and by 3.79% on OVO-Bench. In the multi-round setting, it maintains performance while reducing output tokens by 56%. Code is available at: https://github.com/wl666hhh/Think_While_Watching/

Problem

Research questions and friction points this paper is trying to address.

online streaming

multi-turn video reasoning

segment-level memory

multimodal large language models

long-range dependency

Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming video reasoning

segment-level memory

multimodal large language models