🤖 AI Summary
Existing multimodal dialogue datasets suffer from narrow emotional coverage, limited domain diversity, shallow turn depth, and single-modality constraints, hindering the development of multimodal affective dialogue systems. To address these limitations, we introduce the first large-scale open-source multimodal dialogue dataset comprising 40,150 dialogue turns, spanning 41 domains and 20 fine-grained emotion categories, enabling multi-turn affective coherence modeling and utterance-level emotion alignment. Our methodology integrates collaborative generation from nine language models (4B–72B parameters), human annotation, LLM-based quality filtering, and emotion-controllable text-to-speech synthesis. Key findings include: (i) cross-model generation enhances dialogue coherence; (ii) concrete domains improve consistency; and (iii) empirical analysis reveals a coherence decay threshold of six turns for smaller models. This dataset significantly advances research in emotion-aware, multi-turn consistent, and cross-modal dialogue systems.
📝 Abstract
Recent advances in conversational AI have demonstrated impressive capabilities in single-turn responses, yet multi-turn dialogues remain challenging for even the most sophisticated language models. Current dialogue datasets are limited in their emotional range, domain diversity, turn depth, and are predominantly text-only, hindering progress in developing more human-like conversational systems across modalities. To address these limitations, we present DeepDialogue, a large-scale multimodal dataset containing 40,150 high-quality multi-turn dialogues spanning 41 domains and incorporating 20 distinct emotions with coherent emotional progressions. Our approach pairs 9 different language models (4B-72B parameters) to generate 65,600 initial conversations, which we then evaluate through a combination of human annotation and LLM-based quality filtering. The resulting dataset reveals fundamental insights: smaller models fail to maintain coherence beyond 6 dialogue turns; concrete domains (e.g.,"cars,""travel") yield more meaningful conversations than abstract ones (e.g.,"philosophy"); and cross-model interactions produce more coherent dialogues than same-model conversations. A key contribution of DeepDialogue is its speech component, where we synthesize emotion-consistent voices for all 40,150 dialogues, creating the first large-scale open-source multimodal dialogue dataset that faithfully preserves emotional context across multi-turn conversations.