Speak While Watching: Unleashing TRUE Real-Time Video Understanding Capability of Multimodal Large Language Models

📅 2026-01-11

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the limitation of existing vision-language models in real-time video understanding, where serial dependencies between perception and generation hinder true input-output parallelism. To overcome this, the authors propose a parallel streaming framework that decouples perception from generation, integrating an enhanced positional encoding strategy with a novel Group-Decoupled mechanism. This approach enables multimodal large language models to generate responses concurrently while processing video input for the first time. By introducing three key designs—Overlapped, Group-Decoupled, and Gap-Isolated—the method relaxes the global continuity constraint inherent in conventional positional encodings. Under balanced computational loads, the framework achieves up to a 2× speedup in inference latency while maintaining high accuracy and fluent generation.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have achieved strong performance across many tasks, yet most systems remain limited to offline inference, requiring complete inputs before generating outputs. Recent streaming methods reduce latency by interleaving perception and generation, but still enforce a sequential perception-generation cycle, limiting real-time interaction. In this work, we target a fundamental bottleneck that arises when extending MLLMs to real-time video understanding: the global positional continuity constraint imposed by standard positional encoding schemes. While natural in offline inference, this constraint tightly couples perception and generation, preventing effective input-output parallelism. To address this limitation, we propose a parallel streaming framework that relaxes positional continuity through three designs: Overlapped, Group-Decoupled, and Gap-Isolated. These designs enable simultaneous perception and generation, allowing the model to process incoming inputs while producing responses in real time. Extensive experiments reveal that Group-Decoupled achieves the best efficiency-performance balance, maintaining high fluency and accuracy while significantly reducing latency. We further show that the proposed framework yields up to 2x acceleration under balanced perception-generation workloads, establishing a principled pathway toward speak-while-watching real-time systems. We make all our code publicly available: https://github.com/EIT-NLP/Speak-While-Watching.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

real-time video understanding

positional encoding

streaming inference

input-output parallelism

Innovation

Methods, ideas, or system contributions that make the work stand out.

parallel streaming

real-time video understanding

multimodal large language models