🤖 AI Summary
Existing spoken dialogue systems rely on cascaded pipelines or text-based intermediaries, leading to loss of paralinguistic cues and constrained expressivity. To address this, we propose the first truly end-to-end speech-to-speech large language model that eliminates textual intermediate representations and directly models speech understanding and generation. Methodologically, we design a modality-hierarchical architecture: a frozen, pre-trained language model at the bottom preserves textual capabilities, while an upper layer is dedicated to speech encoding and decoding; crucially, we employ a modality-separate layer-splitting strategy to enable joint, dual-channel learning over speech and text. On spoken question-answering tasks, our model achieves state-of-the-art performance; its speech-to-speech output quality matches that of text-guided systems, while retaining strong textual processing abilities—thereby breaking the long-standing text bottleneck in spoken interaction.
📝 Abstract
Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.