EmotionTalk: An Interactive Chinese Multimodal Emotion Dataset With Rich Annotations

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing Chinese sentiment datasets suffer from poor linguistic–cultural adaptability, unimodal representation, and coarse-grained annotations, hindering high-quality multimodal sentiment analysis. To address these limitations, we introduce CMED—the first high-quality, dialogue-oriented, multimodal sentiment dataset tailored to the Chinese context. CMED comprises natural dyadic conversations involving 19 actors, with 23.6 hours of synchronized audiovisual recordings and corresponding transcripts. Annotations are fine-grained, covering seven emotion categories, five-dimensional sentiment polarity, and four-dimensional prosodic attributes. We propose a standardized pipeline for multimodal acquisition, cross-modal alignment, and data cleaning, ensuring speaker- and modality-consistent labeling. The dataset is publicly released with 19,250 samples. Empirical evaluation demonstrates substantial improvements in generalization performance on both unimodal and multimodal sentiment recognition tasks, establishing a new benchmark for cross-cultural affective modeling, missing-modality imputation, and speech–semantics joint analysis.

Technology Category

Application Category

📝 Abstract

In recent years, emotion recognition plays a critical role in applications such as human-computer interaction, mental health monitoring, and sentiment analysis. While datasets for emotion analysis in languages such as English have proliferated, there remains a pressing need for high-quality, comprehensive datasets tailored to the unique linguistic, cultural, and multimodal characteristics of Chinese. In this work, we propose extbf{EmotionTalk}, an interactive Chinese multimodal emotion dataset with rich annotations. This dataset provides multimodal information from 19 actors participating in dyadic conversational settings, incorporating acoustic, visual, and textual modalities. It includes 23.6 hours of speech (19,250 utterances), annotations for 7 utterance-level emotion categories (happy, surprise, sad, disgust, anger, fear, and neutral), 5-dimensional sentiment labels (negative, weakly negative, neutral, weakly positive, and positive) and 4-dimensional speech captions (speaker, speaking style, emotion and overall). The dataset is well-suited for research on unimodal and multimodal emotion recognition, missing modality challenges, and speech captioning tasks. To our knowledge, it represents the first high-quality and versatile Chinese dialogue multimodal emotion dataset, which is a valuable contribution to research on cross-cultural emotion analysis and recognition. Additionally, we conduct experiments on EmotionTalk to demonstrate the effectiveness and quality of the dataset. It will be open-source and freely available for all academic purposes. The dataset and codes will be made available at: https://github.com/NKU-HLT/EmotionTalk.

Problem

Research questions and friction points this paper is trying to address.

Lack of high-quality Chinese multimodal emotion datasets

Need for culturally tailored emotion recognition resources

Addressing missing modality challenges in emotion analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Chinese emotion dataset with rich annotations

Includes acoustic, visual, and textual modalities

Supports unimodal and multimodal emotion recognition

🔎 Similar Papers

Bridging Discrete and Continuous: A Multimodal Strategy for Complex Emotion Detection