DeepDialogue: A Multi-Turn Emotionally-Rich Spoken Dialogue Dataset

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal dialogue datasets suffer from narrow emotional coverage, limited domain diversity, shallow turn depth, and single-modality constraints, hindering the development of multimodal affective dialogue systems. To address these limitations, we introduce the first large-scale open-source multimodal dialogue dataset comprising 40,150 dialogue turns, spanning 41 domains and 20 fine-grained emotion categories, enabling multi-turn affective coherence modeling and utterance-level emotion alignment. Our methodology integrates collaborative generation from nine language models (4B–72B parameters), human annotation, LLM-based quality filtering, and emotion-controllable text-to-speech synthesis. Key findings include: (i) cross-model generation enhances dialogue coherence; (ii) concrete domains improve consistency; and (iii) empirical analysis reveals a coherence decay threshold of six turns for smaller models. This dataset significantly advances research in emotion-aware, multi-turn consistent, and cross-modal dialogue systems.

Technology Category

Application Category

📝 Abstract
Recent advances in conversational AI have demonstrated impressive capabilities in single-turn responses, yet multi-turn dialogues remain challenging for even the most sophisticated language models. Current dialogue datasets are limited in their emotional range, domain diversity, turn depth, and are predominantly text-only, hindering progress in developing more human-like conversational systems across modalities. To address these limitations, we present DeepDialogue, a large-scale multimodal dataset containing 40,150 high-quality multi-turn dialogues spanning 41 domains and incorporating 20 distinct emotions with coherent emotional progressions. Our approach pairs 9 different language models (4B-72B parameters) to generate 65,600 initial conversations, which we then evaluate through a combination of human annotation and LLM-based quality filtering. The resulting dataset reveals fundamental insights: smaller models fail to maintain coherence beyond 6 dialogue turns; concrete domains (e.g.,"cars,""travel") yield more meaningful conversations than abstract ones (e.g.,"philosophy"); and cross-model interactions produce more coherent dialogues than same-model conversations. A key contribution of DeepDialogue is its speech component, where we synthesize emotion-consistent voices for all 40,150 dialogues, creating the first large-scale open-source multimodal dialogue dataset that faithfully preserves emotional context across multi-turn conversations.
Problem

Research questions and friction points this paper is trying to address.

Limited emotional range in current dialogue datasets
Lack of domain diversity and turn depth in dialogues
Absence of large-scale multimodal datasets with emotional context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal dataset with 40,150 emotionally-rich dialogues
Combines human annotation and LLM-based quality filtering
Synthesizes emotion-consistent voices for all dialogues
🔎 Similar Papers