Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech-language models predominantly operate in unidirectional, turn-taking paradigms, lacking real-time interjection capability and synchronous response. This work introduces the first end-to-end duplex speech-to-speech (S2S) architecture, eliminating the need for pre-trained speech modules and directly modeling concurrent user and agent speech streams. Methodologically, it employs a streaming encoder, separate user/agent modeling, codec-channel fusion, and an LLM-driven duplex generation mechanism. Key contributions include: (1) the first purely end-to-end duplex S2S paradigm; (2) the first publicly released complete training and inference codebase; (3) high-fidelity speech synthesis at an ultra-low bitrate of 0.6 kbps; and (4) drastically reduced data requirements, enabling rapid adaptation to arbitrary LLMs. Experiments demonstrate substantial improvements over prior duplex approaches in interjection latency, turn-taking control accuracy, and speech naturalness.

Technology Category

Application Category

📝 Abstract
Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.
Problem

Research questions and friction points this paper is trying to address.

Enables real-time duplex speech interaction with barge-in capability
Reduces bitrate and improves agent voice quality via codec fine-tuning
Eliminates speech pretrain need, simplifying duplex S2S model development
Innovation

Methods, ideas, or system contributions that make the work stand out.

Duplex S2S architecture with channel fusion
Separate user and agent modeling architectures
Pretrained streaming encoder enables no speech pretrain
🔎 Similar Papers
No similar papers found.