Chain-of-Thought Training for Open E2E Spoken Dialogue Systems

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing end-to-end spoken dialogue systems rely heavily on large-scale annotated data and often generate responses lacking semantic coherence. To address these limitations, we propose the first framework that integrates Chain-of-Thought (CoT) prompting into open-domain end-to-end spoken dialogue training, unifying automatic speech recognition (ASR), text understanding, and text-to-speech (TTS) within a single architecture. Our approach leverages CoT prompts to induce implicit reasoning paths, enabling multi-task pretraining alignment and efficient few-shot adaptation. Trained on only 300 hours of publicly available conversational data, our model achieves over a 1.5-point improvement in ROUGE-1 score over strong baselines and demonstrates robust performance on standard benchmarks including Switchboard. The method effectively mitigates information loss and response incoherence inherent in cascaded architectures, supports end-to-end joint optimization, and all models and code are publicly released.

Technology Category

Application Category

📝 Abstract
Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-thought (CoT) formulation, ensuring that training on conversational data remains closely aligned with the multimodal language model (LM)'s pre-training on speech recognition~(ASR), text-to-speech synthesis (TTS), and text LM tasks. Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets, while being compute-efficient enough to train on just 300 hours of public human-human conversation data, such as the Switchboard. We will publicly release our models and training code.
Problem

Research questions and friction points this paper is trying to address.

Improving semantic coherence in E2E spoken dialogue responses
Reducing data requirements for training spoken dialogue systems
Aligning multimodal LM pre-training with conversational data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought formulation for E2E dialogue
Multimodal LM alignment with ASR and TTS
Compute-efficient training on 300h data
🔎 Similar Papers
No similar papers found.