AffectSpeech: A Large-Scale Emotional Speech Dataset with Fine-Grained Textual Descriptions for Speech Emotion Captioning and Synthesis

📅 2026-04-05

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the limitations of existing speech emotion modeling approaches, which typically rely on predefined categorical labels or low-dimensional continuous annotations and thus struggle to capture fine-grained emotional expressions or align with natural language. To overcome these challenges, the authors introduce AffectSpeech, a novel dataset featuring six-dimensional structured annotations—encompassing emotional valence, open-vocabulary descriptions, intensity, prosodic characteristics, salient segments, and semantic content. High-quality, diverse fine-grained labels are achieved through a human-in-the-loop pipeline that integrates algorithmic pre-annotation, generative descriptions from multiple large language models, and rigorous human validation. Models trained on AffectSpeech for speech emotion description and synthesis significantly outperform current state-of-the-art methods across multiple evaluation metrics.

Technology Category

Application Category

📝 Abstract

Emotion is essential in spoken communication, yet most existing frameworks in speech emotion modeling rely on predefined categories or low-dimensional continuous attributes, which offer limited expressive capacity. Recent advances in speech emotion captioning and synthesis have shown that textual descriptions provide a more flexible and interpretable alternative for representing affective characteristics in speech. However, progress in this direction is hindered by the lack of an emotional speech dataset aligned with reliable and fine-grained natural language annotations. To tackle this, we introduce AffectSpeech, a large-scale corpus of human-recorded speech enriched with structured descriptions for fine-grained emotion analysis and generation. Each utterance is characterized across six complementary dimensions, including sentiment polarity, open-vocabulary emotion captions, intensity level, prosodic attributes, prominent segments, and semantic content, enabling multi-granular modeling of vocal expression. To balance annotation quality and scalability, we adopt a human-LLM collaborative annotation pipeline that integrates algorithmic pre-labeling, multi-LLM description generation, and human-in-the-loop verification. Furthermore, these annotations are reformulated into diverse descriptive styles to enhance linguistic diversity and reduce stylistic bias in downstream modeling. Experimental results on speech emotion captioning and synthesis demonstrate that models trained on AffectSpeech consistently achieve superior performance across multiple evaluation settings.

Problem

Research questions and friction points this paper is trying to address.

speech emotion modeling

emotional speech dataset

textual descriptions

fine-grained annotation

natural language annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained emotion annotation

speech emotion captioning

human-LLM collaborative annotation