OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios

📅 2025-01-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current speech dialogue systems suffer degraded performance in realistic, acoustically complex scenarios—such as audio mixing, background music interference, and emotional variability—primarily due to the scarcity of high-quality, multi-scenario conversational data. To address this, we introduce ShareChatX, the first large-scale synthetic speech dialogue dataset covering diverse acoustic conditions, and propose OmniChat, a unified dialogue system. Our approach features: (1) a novel synthetic-data-driven paradigm explicitly designed for complex acoustic environments; (2) a heterogeneous fusion module with dynamic feature selection that jointly models speech, musical context, and emotional states; and (3) an optimized training strategy integrating synthetic and real-world data. Evaluated on the real-world DailyTalk benchmark, OmniChat achieves state-of-the-art performance, demonstrating substantial improvements in audio event recognition, music-aware contextual understanding, and emotion expression modeling.

Technology Category

Application Category

📝 Abstract
With the rapid development of large language models, researchers have created increasingly advanced spoken dialogue systems that can naturally converse with humans. However, these systems still struggle to handle the full complexity of real-world conversations, including audio events, musical contexts, and emotional expressions, mainly because current dialogue datasets are constrained in both scale and scenario diversity. In this paper, we propose leveraging synthetic data to enhance the dialogue models across diverse scenarios. We introduce ShareChatX, the first comprehensive, large-scale dataset for spoken dialogue that spans diverse scenarios. Based on this dataset, we introduce OmniChat, a multi-turn dialogue system with a heterogeneous feature fusion module, designed to optimize feature selection in different dialogue contexts. In addition, we explored critical aspects of training dialogue systems using synthetic data. Through comprehensive experimentation, we determined the ideal balance between synthetic and real data, achieving state-of-the-art results on the real-world dialogue dataset DailyTalk. We also highlight the crucial importance of synthetic data in tackling diverse, complex dialogue scenarios, especially those involving audio and music. For more details, please visit our demo page at url{https://sharechatx.github.io/}.
Problem

Research questions and friction points this paper is trying to address.

Speech Dialogue Systems
Complex Real-life Scenarios
Insufficient Training Data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic Data
Speech Dialogue Systems
Complex Scene Handling
🔎 Similar Papers
No similar papers found.