🤖 AI Summary
This work addresses the limitations of existing speech emotion modeling approaches, which typically rely on predefined categorical labels or low-dimensional continuous annotations and thus struggle to capture fine-grained emotional expressions or align with natural language. To overcome these challenges, the authors introduce AffectSpeech, a novel dataset featuring six-dimensional structured annotations—encompassing emotional valence, open-vocabulary descriptions, intensity, prosodic characteristics, salient segments, and semantic content. High-quality, diverse fine-grained labels are achieved through a human-in-the-loop pipeline that integrates algorithmic pre-annotation, generative descriptions from multiple large language models, and rigorous human validation. Models trained on AffectSpeech for speech emotion description and synthesis significantly outperform current state-of-the-art methods across multiple evaluation metrics.
📝 Abstract
Emotion is essential in spoken communication, yet most existing frameworks in speech emotion modeling rely on predefined categories or low-dimensional continuous attributes, which offer limited expressive capacity. Recent advances in speech emotion captioning and synthesis have shown that textual descriptions provide a more flexible and interpretable alternative for representing affective characteristics in speech. However, progress in this direction is hindered by the lack of an emotional speech dataset aligned with reliable and fine-grained natural language annotations. To tackle this, we introduce AffectSpeech, a large-scale corpus of human-recorded speech enriched with structured descriptions for fine-grained emotion analysis and generation. Each utterance is characterized across six complementary dimensions, including sentiment polarity, open-vocabulary emotion captions, intensity level, prosodic attributes, prominent segments, and semantic content, enabling multi-granular modeling of vocal expression. To balance annotation quality and scalability, we adopt a human-LLM collaborative annotation pipeline that integrates algorithmic pre-labeling, multi-LLM description generation, and human-in-the-loop verification. Furthermore, these annotations are reformulated into diverse descriptive styles to enhance linguistic diversity and reduce stylistic bias in downstream modeling. Experimental results on speech emotion captioning and synthesis demonstrate that models trained on AffectSpeech consistently achieve superior performance across multiple evaluation settings.