Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

📅 2025-01-27

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Current text-to-speech (TTS) models are predominantly trained on audiobook data, limiting their ability to capture the spontaneity, speaker timbre diversity, and stylistic fidelity characteristic of natural conversational speech. To address this, we introduce Emilia—the first large-scale, multilingual dataset of real-world spontaneous speech (101k+ hours, 6 languages)—and its extended version Emilia-Large (216k+ hours). We further propose Emilia-Pipe, an open-source preprocessing pipeline enabling high-fidelity, scalable construction of multilingual spontaneous speech data for the first time. Our pipeline integrates robust voice cleaning and alignment, multilingual ASR with text normalization, acoustic quality assessment and filtering, and cross-lingual feature modeling. These innovations significantly improve synthetic speech naturalness and prosodic diversity. Empirical evaluation demonstrates consistent and substantial improvements over audiobook-based baselines across multilingual and cross-lingual TTS benchmarks. The publicly released Emilia datasets aim to advance community research in spontaneous-speech synthesis.

Technology Category

Application Category

📝 Abstract

Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation.

Problem

Research questions and friction points this paper is trying to address.

Speech Synthesis

Naturalness

Diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Emilia-Pipe

multilingual speech synthesis

naturalistic speech data

🔎 Similar Papers

MAD Speech: Measures of Acoustic Diversity of Speech