Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models

πŸ“… 2026-03-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the scarcity of high-quality, multi-speaker natural conversation audio dataβ€”a key bottleneck in developing full-duplex speech language models. Existing datasets are often limited to single speakers or small scales, and standard preprocessing pipelines are prone to speaker diarization errors and ASR hallucinations. To overcome these challenges, we propose the first open-source, end-to-end scalable preprocessing framework tailored for full-duplex speech language modeling. Our approach integrates robust speaker separation, multi-channel speech alignment, hallucination-resistant ASR post-processing, and dialogue structure modeling. This pipeline substantially mitigates speaker confusion and recognition errors, enabling the generation of high-quality, multi-turn, multi-speaker conversational datasets. The resulting data provides a reliable foundation for training full-duplex models, significantly enhancing interaction naturalness and real-time responsiveness.
πŸ“ Abstract
As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume. Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations. To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.
Problem

Research questions and friction points this paper is trying to address.

full-duplex
speech language models
multi-speaker conversation
diarization errors
ASR hallucinations
Innovation

Methods, ideas, or system contributions that make the work stand out.

full-duplex speech language models
multi-turn audio preprocessing
speaker diarization
ASR robustness
open-source pipeline
πŸ”Ž Similar Papers
No similar papers found.