Speak While Watching: Unleashing TRUE Real-Time Video Understanding Capability of Multimodal Large Language Models

📅 2026-01-11
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing vision-language models in real-time video understanding, where serial dependencies between perception and generation hinder true input-output parallelism. To overcome this, the authors propose a parallel streaming framework that decouples perception from generation, integrating an enhanced positional encoding strategy with a novel Group-Decoupled mechanism. This approach enables multimodal large language models to generate responses concurrently while processing video input for the first time. By introducing three key designs—Overlapped, Group-Decoupled, and Gap-Isolated—the method relaxes the global continuity constraint inherent in conventional positional encodings. Under balanced computational loads, the framework achieves up to a 2× speedup in inference latency while maintaining high accuracy and fluent generation.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved strong performance across many tasks, yet most systems remain limited to offline inference, requiring complete inputs before generating outputs. Recent streaming methods reduce latency by interleaving perception and generation, but still enforce a sequential perception-generation cycle, limiting real-time interaction. In this work, we target a fundamental bottleneck that arises when extending MLLMs to real-time video understanding: the global positional continuity constraint imposed by standard positional encoding schemes. While natural in offline inference, this constraint tightly couples perception and generation, preventing effective input-output parallelism. To address this limitation, we propose a parallel streaming framework that relaxes positional continuity through three designs: Overlapped, Group-Decoupled, and Gap-Isolated. These designs enable simultaneous perception and generation, allowing the model to process incoming inputs while producing responses in real time. Extensive experiments reveal that Group-Decoupled achieves the best efficiency-performance balance, maintaining high fluency and accuracy while significantly reducing latency. We further show that the proposed framework yields up to 2x acceleration under balanced perception-generation workloads, establishing a principled pathway toward speak-while-watching real-time systems. We make all our code publicly available: https://github.com/EIT-NLP/Speak-While-Watching.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
real-time video understanding
positional encoding
streaming inference
input-output parallelism
Innovation

Methods, ideas, or system contributions that make the work stand out.

parallel streaming
real-time video understanding
multimodal large language models
positional encoding decoupling
Group-Decoupled
🔎 Similar Papers
J
Junyan Lin
Department of Computing, The Hong Kong Polytechnic University
J
Junlong Tong
Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, EIT
H
Hao Wu
Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, EIT
J
Jialiang Zhang
Ocean University of China
Jinming Liu
Jinming Liu
Shanghai Jiao Tong Univeristy
VLMLLMComputer VisionImage/Video Compression
Xin Jin
Xin Jin
Assistant Professor - Eastern Institute of Technology, Ningbo, China << NUS << USTC
Intelligent CodingComputer Vision
Xiaoyu Shen
Xiaoyu Shen
Eastern Institute of Technology, Ningbo
language modelmulti-modal learningreasoning