The MSP-Podcast Corpus

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing speech emotion recognition (SER) datasets suffer from limited scale, imbalanced emotional class distribution, and insufficient speaker and environmental diversity, hindering real-world applicability. To address these limitations, this work introduces a large-scale, high-quality emotional speech database comprising over 400 hours of authentic, multi-source speech recordings captured across diverse acoustic environments and speakers. We propose a machine learning–driven, emotion-aware filtering pipeline to ensure balanced representation across emotion categories. Annotations are performed via a multi-annotator consensus protocol, yielding primary and secondary emotion labels, fine-grained affective attributes, and speaker identities. The database further integrates high-accuracy automatic speech recognition transcriptions and speaker diarization outputs. With clear licensing, comprehensive annotation, and ecological validity, the dataset significantly enhances SER model generalizability and practical utility, establishing a robust foundation for emotion computing research in realistic scenarios.

Technology Category

Application Category

📝 Abstract

The availability of large, high-quality emotional speech databases is essential for advancing speech emotion recognition (SER) in real-world scenarios. However, many existing databases face limitations in size, emotional balance, and speaker diversity. This study describes the MSP-Podcast corpus, summarizing our ten-year effort. The corpus consists of over 400 hours of diverse audio samples from various audio-sharing websites, all of which have Common Licenses that permit the distribution of the corpus. We annotate the corpus with rich emotional labels, including primary (single dominant emotion) and secondary (multiple emotions perceived in the audio) emotional categories, as well as emotional attributes for valence, arousal, and dominance. At least five raters annotate these emotional labels. The corpus also has speaker identification for most samples, and human transcriptions of the lexical content of the sentences for the entire corpus. The data collection protocol includes a machine learning-driven pipeline for selecting emotionally diverse recordings, ensuring a balanced and varied representation of emotions across speakers and environments. The resulting database provides a comprehensive, high-quality resource, better suited for advancing SER systems in practical, real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in existing emotional speech databases

Creates a large, diverse corpus for speech emotion recognition

Provides richly annotated emotional labels and speaker identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale emotional speech corpus

Multi-level emotional annotation system

Machine learning-driven data selection

🔎 Similar Papers

No similar papers found.