Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the scarcity of high-quality, multi-speaker natural conversation audio data—a key bottleneck in developing full-duplex speech language models. Existing datasets are often limited to single speakers or small scales, and standard preprocessing pipelines are prone to speaker diarization errors and ASR hallucinations. To overcome these challenges, we propose the first open-source, end-to-end scalable preprocessing framework tailored for full-duplex speech language modeling. Our approach integrates robust speaker separation, multi-channel speech alignment, hallucination-resistant ASR post-processing, and dialogue structure modeling. This pipeline substantially mitigates speaker confusion and recognition errors, enabling the generation of high-quality, multi-turn, multi-speaker conversational datasets. The resulting data provides a reliable foundation for training full-duplex models, significantly enhancing interaction naturalness and real-time responsiveness.

Technology Category

Application Category

📝 Abstract

As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume. Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations. To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.

Problem

Research questions and friction points this paper is trying to address.

full-duplex

speech language models

multi-speaker conversation

diarization errors

ASR hallucinations

Innovation

Methods, ideas, or system contributions that make the work stand out.

full-duplex speech language models

multi-turn audio preprocessing

speaker diarization