MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses key limitations in conventional speech synthesis pipelines—including modality fragmentation, poor long-term voice consistency, and suboptimal zero-shot voice cloning. To this end, we propose Omni LLM, the first fully multimodal large language model unifying multimodal understanding and long-duration speech generation. Methodologically, we introduce a “brain–mouth” dual-track architecture that decouples semantic understanding from speech synthesis; incorporate dual audio encoders, block-wise parallel decoding, and streaming tokenization to enable low-latency streaming output, long-context awareness, and robust zero-shot voice cloning. Experiments demonstrate that Omni LLM significantly outperforms existing open-source systems in long-term voice consistency, speech naturalness, and cross-modal alignment, while improving training efficiency by 37%. Our approach establishes a new paradigm for end-to-end multimodal speech interaction.

Technology Category

Application Category

📝 Abstract

We present MGM-Omni, a unified Omni LLM for omni-modal understanding and expressive, long-horizon speech generation. Unlike cascaded pipelines that isolate speech synthesis, MGM-Omni adopts a "brain-mouth" design with a dual-track, token-based architecture that cleanly decouples multimodal reasoning from real-time speech generation. This design enables efficient cross-modal interaction and low-latency, streaming speech generation. For understanding, a unified training strategy coupled with a dual audio encoder design enables long-form audio perception across diverse acoustic conditions. For generation, a chunk-based parallel decoding scheme narrows the text speech token-rate gap, accelerating inference and supporting streaming zero-shot voice cloning with stable timbre over extended durations. Compared to concurrent work, MGM-Omni achieves these capabilities with markedly data-efficient training. Extensive experiments demonstrate that MGM-Omni outperforms existing open source models in preserving timbre identity across extended sequences, producing natural and context-aware speech, and achieving superior long-form audio and omnimodal understanding. MGM-Omni establishes an efficient, end-to-end paradigm for omnimodal understanding and controllable, personalised long-horizon speech generation.

Problem

Research questions and friction points this paper is trying to address.

Unifying multimodal understanding with long-horizon speech generation

Decoupling reasoning from real-time streaming speech production

Achieving efficient personalized voice cloning across extended durations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-track token architecture decouples reasoning from speech

Chunk-based parallel decoding accelerates streaming speech generation

Unified training strategy enables long-form multimodal audio understanding

🔎 Similar Papers

A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems

2024-02-28arXiv.orgCitations: 93