Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

πŸ“… 2025-01-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current text-to-speech (TTS) models are predominantly trained on audiobook data, limiting their ability to capture the spontaneity, speaker timbre diversity, and stylistic fidelity characteristic of natural conversational speech. To address this, we introduce Emiliaβ€”the first large-scale, multilingual dataset of real-world spontaneous speech (101k+ hours, 6 languages)β€”and its extended version Emilia-Large (216k+ hours). We further propose Emilia-Pipe, an open-source preprocessing pipeline enabling high-fidelity, scalable construction of multilingual spontaneous speech data for the first time. Our pipeline integrates robust voice cleaning and alignment, multilingual ASR with text normalization, acoustic quality assessment and filtering, and cross-lingual feature modeling. These innovations significantly improve synthetic speech naturalness and prosodic diversity. Empirical evaluation demonstrates consistent and substantial improvements over audiobook-based baselines across multilingual and cross-lingual TTS benchmarks. The publicly released Emilia datasets aim to advance community research in spontaneous-speech synthesis.

Technology Category

Application Category

πŸ“ Abstract
Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation.
Problem

Research questions and friction points this paper is trying to address.

Speech Synthesis
Naturalness
Diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Emilia-Pipe
multilingual speech synthesis
naturalistic speech data
πŸ”Ž Similar Papers
H
Haorui He
Chinese University of Hong Kong, Shenzhen, China
Zengqiang Shang
Zengqiang Shang
Institute of Acoustics Chinese Academy of Sciences
speech
Chaoren Wang
Chaoren Wang
The Chinese University of Hong Kong, Shenzhen
Spoken Language ProcessingLLM
X
Xuyuan Li
Laboratory of Speech and Intelligent Information Processing, Institute of Acoustics, CAS, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Yicheng Gu
Yicheng Gu
Aalto University
Speech and Singing Voice SynthesisAudio-Visual GenerationDigital Audio Effects
Hua Hua
Hua Hua
Tencent
Deep LearningQualitative Spatial Reasoning
Liwei Liu
Liwei Liu
Shenzhen University
Biophotonics
C
Chen Yang
Laboratory of Speech and Intelligent Information Processing, Institute of Acoustics, CAS, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
J
Jiaqi Li
Chinese University of Hong Kong, Shenzhen, China
Peiyang Shi
Peiyang Shi
Laboratory of Speech and Intelligent Information Processing, Institute of Acoustics, CAS, Beijing, China
Yuancheng Wang
Yuancheng Wang
The Chinese University of Hong Kong, Shenzhen
Deep LearningSpeech SynthesisMusic GenerationAudio Generation
K
Kai Chen
Shanghai AI Laboratory, Shanghai, China
P
Pengyuan Zhang
Laboratory of Speech and Intelligent Information Processing, Institute of Acoustics, CAS, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Zhizheng Wu
Zhizheng Wu
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Mel Lab
Spoken Language ProcessingDeepFake detectionMusic Processing