🤖 AI Summary
Existing end-to-end spoken dialogue systems rely heavily on large-scale annotated data and often generate responses lacking semantic coherence. To address these limitations, we propose the first framework that integrates Chain-of-Thought (CoT) prompting into open-domain end-to-end spoken dialogue training, unifying automatic speech recognition (ASR), text understanding, and text-to-speech (TTS) within a single architecture. Our approach leverages CoT prompts to induce implicit reasoning paths, enabling multi-task pretraining alignment and efficient few-shot adaptation. Trained on only 300 hours of publicly available conversational data, our model achieves over a 1.5-point improvement in ROUGE-1 score over strong baselines and demonstrates robust performance on standard benchmarks including Switchboard. The method effectively mitigates information loss and response incoherence inherent in cascaded architectures, supports end-to-end joint optimization, and all models and code are publicly released.
📝 Abstract
Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-thought (CoT) formulation, ensuring that training on conversational data remains closely aligned with the multimodal language model (LM)'s pre-training on speech recognition~(ASR), text-to-speech synthesis (TTS), and text LM tasks. Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets, while being compute-efficient enough to train on just 300 hours of public human-human conversation data, such as the Switchboard. We will publicly release our models and training code.