π€ AI Summary
Current text-to-speech (TTS) models are predominantly trained on audiobook data, limiting their ability to capture the spontaneity, speaker timbre diversity, and stylistic fidelity characteristic of natural conversational speech. To address this, we introduce Emiliaβthe first large-scale, multilingual dataset of real-world spontaneous speech (101k+ hours, 6 languages)βand its extended version Emilia-Large (216k+ hours). We further propose Emilia-Pipe, an open-source preprocessing pipeline enabling high-fidelity, scalable construction of multilingual spontaneous speech data for the first time. Our pipeline integrates robust voice cleaning and alignment, multilingual ASR with text normalization, acoustic quality assessment and filtering, and cross-lingual feature modeling. These innovations significantly improve synthetic speech naturalness and prosodic diversity. Empirical evaluation demonstrates consistent and substantial improvements over audiobook-based baselines across multilingual and cross-lingual TTS benchmarks. The publicly released Emilia datasets aim to advance community research in spontaneous-speech synthesis.
π Abstract
Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation.