NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current voice conversion (VC) research is hindered by the scarcity of large-scale, spontaneous, and emotionally rich real-world speech data, limiting models’ ability to capture natural prosody and expressive emotion. To address this, we introduce PodCast-VC—the first large-scale podcast-based VC dataset—comprising thousands of hours of spontaneous conversational speech, annotated with fine-grained emotion (category + attributes), speaker identity, audio quality, ASR transcripts, and sound events. We publicly release a customizable, automated annotation and filtering pipeline. Extensive experiments demonstrate that PodCast-VC substantially improves VC model naturalness and expressiveness. Moreover, our analysis uncovers critical bottlenecks in prevailing VC architectures when modeling long-duration, highly variable spontaneous speech. This work establishes a new benchmark and foundational infrastructure for emotion-aware voice conversion.

Technology Category

Application Category

📝 Abstract
Everyday speech conveys far more than words, it reflects who we are, how we feel, and the circumstances surrounding our interactions. Yet, most existing speech datasets are acted, limited in scale, and fail to capture the expressive richness of real-life communication. With the rise of large neural networks, several large-scale speech corpora have emerged and been widely adopted across various speech processing tasks. However, the field of voice conversion (VC) still lacks large-scale, expressive, and real-life speech resources suitable for modeling natural prosody and emotion. To fill this gap, we release NaturalVoices (NV), the first large-scale spontaneous podcast dataset specifically designed for emotion-aware voice conversion. It comprises 5,049 hours of spontaneous podcast recordings with automatic annotations for emotion (categorical and attribute-based), speech quality, transcripts, speaker identity, and sound events. The dataset captures expressive emotional variation across thousands of speakers, diverse topics, and natural speaking styles. We also provide an open-source pipeline with modular annotation tools and flexible filtering, enabling researchers to construct customized subsets for a wide range of VC tasks. Experiments demonstrate that NaturalVoices supports the development of robust and generalizable VC models capable of producing natural, expressive speech, while revealing limitations of current architectures when applied to large-scale spontaneous data. These results suggest that NaturalVoices is both a valuable resource and a challenging benchmark for advancing the field of voice conversion. Dataset is available at: https://huggingface.co/JHU-SmileLab
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale spontaneous speech datasets for voice conversion
Need for emotion-aware voice conversion with natural prosody
Absence of expressive real-life speech resources for VC modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale spontaneous podcast dataset for voice conversion
Automatic emotion and speech quality annotation pipeline
Modular tools for customizable voice conversion subsets
🔎 Similar Papers
No similar papers found.
Zongyang Du
Zongyang Du
Chongqing University of Posts and Telecommunications
S
Shreeram Suresh Chandra
Department of Electrical and Computer Engineering, University of Texas at Dallas, Richardson, TX 75080 USA, and also with the Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
I
Ismail Rasim Ulgen
Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
A
Aurosweta Mahapatra
Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
Ali N. Salman
Ali N. Salman
ARRAY Innovation, Bahrain
C
Carlos Busso
Language Technologies Institute, Carnegie Mellon University, Pittsburgh PA-15213 USA
Berrak Sisman
Berrak Sisman
Assistant Professor (ECE & DSAI), Johns Hopkins University
Machine LearningAffective ComputingSpeech SynthesisVoice ConversionAnti-spoofing