MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing spoken dialogue systems rely on cascaded pipelines or text-based intermediaries, leading to loss of paralinguistic cues and constrained expressivity. To address this, we propose the first truly end-to-end speech-to-speech large language model that eliminates textual intermediate representations and directly models speech understanding and generation. Methodologically, we design a modality-hierarchical architecture: a frozen, pre-trained language model at the bottom preserves textual capabilities, while an upper layer is dedicated to speech encoding and decoding; crucially, we employ a modality-separate layer-splitting strategy to enable joint, dual-channel learning over speech and text. On spoken question-answering tasks, our model achieves state-of-the-art performance; its speech-to-speech output quality matches that of text-guided systems, while retaining strong textual processing abilities—thereby breaking the long-standing text bottleneck in spoken interaction.

Technology Category

Application Category

📝 Abstract

Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.

Problem

Research questions and friction points this paper is trying to address.

Eliminating text intermediaries in speech-to-speech systems

Preserving paralinguistic cues lost in cascaded pipelines

Maintaining text reasoning while enabling direct speech generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct speech-to-speech model without text guidance

Modality-based layer-splitting with frozen pretraining strategy

Preserves text LLM reasoning while adding speech capabilities

🔎 Similar Papers

No similar papers found.