MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

📅 2025-01-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current speech-text multimodal models suffer from modality processing speed mismatches, low learning efficiency, limited training data scale, and coarse-grained modeling—hindering natural, fluent full-duplex human-machine speech interaction. To address these challenges, we propose MinMo, an 8B-parameter multimodal large language model featuring a novel four-stage contrastive alignment training paradigm: speech-text, text-speech, speech-speech, and full-duplex interaction. MinMo incorporates a lightweight speech decoder and supports instruction-driven fine-grained voice control—including emotion, dialect, speaking rate, and timbre. Trained on 1.4M hours of speech data, it employs end-to-end joint modeling, instruction tuning, and a low-latency streaming inference architecture. MinMo achieves state-of-the-art performance in both speech understanding and generation. Empirical evaluation shows an end-to-end full-duplex latency of 800 ms (theoretical minimum: 600 ms) and ASR latency of ~100 ms, significantly improving instruction following and voice controllability.

Technology Category

Application Category

📝 Abstract
Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Asynchronous processing speed
Low learning efficiency
Insufficient data volume
Innovation

Methods, ideas, or system contributions that make the work stand out.

MinMo Model
Bimodal Speech-Text Processing
Low-Latency Bidirectional Dialogue
🔎 Similar Papers
No similar papers found.
Q
Qian Chen
Tongyi Lab, Alibaba Group
Yafeng Chen
Yafeng Chen
University of Science and Technology of China
Large Audio Language ModelSpeech Signal ProcessingDeep Learning
Y
Yanni Chen
Tongyi Lab, Alibaba Group
M
Mengzhe Chen
Tongyi Lab, Alibaba Group
Yingda Chen
Yingda Chen
Alibaba Group, Microsoft
Chong Deng
Chong Deng
alibaba group
machine learningnatural language processing
Zhihao Du
Zhihao Du
Alibaba
Speech separationspeech enchancementspeaker diarization
R
Ruize Gao
Tongyi Lab, Alibaba Group
C
Changfeng Gao
Tongyi Lab, Alibaba Group
Z
Zhifu Gao
Tongyi Lab, Alibaba Group
Y
Yabin Li
Tongyi Lab, Alibaba Group
X
Xiang Lv
Tongyi Lab, Alibaba Group
Jiaqing Liu
Jiaqing Liu
Renmin University of China
Natural Language ProcessingDeep LearningMachine LearningFinance
H
Haoneng Luo
Tongyi Lab, Alibaba Group
B
Bin Ma
Tongyi Lab, Alibaba Group
C
Chongjia Ni
Tongyi Lab, Alibaba Group
Xian Shi
Xian Shi
Qwen Team, Alibaba
speech recognitionaudio LLMOmni
Jialong Tang
Jialong Tang
Qwen Team, Alibaba
LLMNLP
H
Hui Wang
Tongyi Lab, Alibaba Group
H
Hao Wang
Tongyi Lab, Alibaba Group
W
Wen Wang
Tongyi Lab, Alibaba Group
Y
Yuxuan Wang
Tongyi Lab, Alibaba Group
Y
Yunlan Xu
Tongyi Lab, Alibaba Group
F
Fan Yu
Tongyi Lab, Alibaba Group
Z
Zhijie Yan
Tongyi Lab, Alibaba Group
Yexin Yang
Yexin Yang
Shanghai Jiao Tong University
Speaker VerificationSpeech ProcessingDeep LearningMachine Learning
Baosong Yang
Baosong Yang
Alibaba-inc
Machine LearningLarge Language ModelMachine Translation
Xian Yang
Xian Yang
University of Manchester
Artificial IntelligenceMachine LearningHealthcare AINatural Language Processing
Guanrou Yang
Guanrou Yang
Shanghai Jiao Tong University
T
Tianyu Zhao
Tongyi Lab, Alibaba Group
Q
Qinglin Zhang
Tongyi Lab, Alibaba Group
Shiliang Zhang
Shiliang Zhang
Department of Computer Science, School of EECS, Peking University
Multimedia Information RetrievalMultimedia SystemsVisual Search
N
Nan Zhao
Tongyi Lab, Alibaba Group
P
Pei Zhang
Tongyi Lab, Alibaba Group
C
Chong Zhang
Tongyi Lab, Alibaba Group
J
Jinren Zhou
Tongyi Lab, Alibaba Group