Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Current large multimodal models (LMMs) rely on sequential concatenation and massive cross-modal data for alignment, resulting in inefficient interaction and limited flexibility. To address this, we propose a semantics-driven differential alignment strategy: vision–text alignment is performed via sequence-level concatenation, while speech–text alignment employs a CTC-guided layer-wise mapping for lightweight cross-modal transfer. We introduce the first layer-dimensional speech–text alignment mechanism, decoupling modality fusion strategies; and enable, for the first time, dual-path real-time intermediate outputs—ASR transcription and model response—during speech interaction. Built upon a large language model backbone, our architecture supports synchronous text–vision–speech interaction via multimodal joint training and inference. Experiments demonstrate state-of-the-art performance on visual understanding, speech interaction, and audio-visual joint tasks, significantly reducing speech-data dependency while enabling low-latency, high-fidelity real-time multimodal interaction.

Technology Category

Application Category

📝 Abstract

The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments. In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments. To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience.

Problem

Research questions and friction points this paper is trying to address.

Efficient alignment of text, vision, and speech modalities

Simultaneous interaction under various modality combinations

Reducing data dependency for modality alignment learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM backbone aligns vision and speech to text

Sequence-dimension concatenation for vision-text alignment

CTC-based layer-dimension mapping for speech-text alignment

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs