MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing spoken dialogue systems rely on cascaded pipelines or text-based intermediaries, leading to loss of paralinguistic cues and constrained expressivity. To address this, we propose the first truly end-to-end speech-to-speech large language model that eliminates textual intermediate representations and directly models speech understanding and generation. Methodologically, we design a modality-hierarchical architecture: a frozen, pre-trained language model at the bottom preserves textual capabilities, while an upper layer is dedicated to speech encoding and decoding; crucially, we employ a modality-separate layer-splitting strategy to enable joint, dual-channel learning over speech and text. On spoken question-answering tasks, our model achieves state-of-the-art performance; its speech-to-speech output quality matches that of text-guided systems, while retaining strong textual processing abilities—thereby breaking the long-standing text bottleneck in spoken interaction.

Technology Category

Application Category

📝 Abstract
Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
Problem

Research questions and friction points this paper is trying to address.

Eliminating text intermediaries in speech-to-speech systems
Preserving paralinguistic cues lost in cascaded pipelines
Maintaining text reasoning while enabling direct speech generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct speech-to-speech model without text guidance
Modality-based layer-splitting with frozen pretraining strategy
Preserves text LLM reasoning while adding speech capabilities
🔎 Similar Papers
No similar papers found.
X
Xingjian Zhao
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
Z
Zhe Xu
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
L
Luozhijie Jin
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
Y
Yang Wang
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
H
Hanfu Chen
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
Y
Yaozhou Jiang
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
K
Ke Chen
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
R
Ruixiao Li
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
M
Mingshu Chen
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
Ruiming Wang
Ruiming Wang
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
W
Wenbo Zhang
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
Y
Yiyang Zhang
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
D
Donghua Yu
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
Y
Yang Gao
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
X
Xiaogui Yang
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
Y
Yitian Gong
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
Yuanfan Xu
Yuanfan Xu
Tsinghua University
robotic computingmulti-agent systemsembodied AIDomain-specific Accelerator
Q
Qinyuan Cheng
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
Zhaoye Fei
Zhaoye Fei
Fudan University
Natural Language Processing
Shimin Li
Shimin Li
Fudan University
Large Language ModelSpeech Language Model
Y
Yaqian Zhou
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
X
Xuanjing Huang
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI
X
Xipeng Qiu
SLM Team, Shanghai Innovation Institute, Fudan University, MOSI