🤖 AI Summary
To address the critical bottleneck of insufficient large-scale, open-source speech data for mainstream Chinese dialects—particularly Sichuanese—which severely hinders ASR and TTS development, this work introduces WenetSpeech-Chuan, the largest publicly available Sichuanese speech corpus to date (10,000 hours). We propose Chuan-Pipeline, an end-to-end dialect data processing framework integrating ASR-based pre-annotation, text-speech alignment, pronunciation variant modeling, and multi-stage human verification to enable efficient data cleaning and fine-grained annotation. Leveraging this corpus, we release a standardized ASR/TTS benchmark, substantially lowering barriers to dialect speech research. Models trained on WenetSpeech-Chuan achieve production-grade performance within open-source ecosystems and demonstrate exceptional fairness and bias mitigation across multiple dialects in rigorous cross-dialect evaluation.
📝 Abstract
The scarcity of large-scale, open-source data for dialects severely hinders progress in speech technology, a challenge particularly acute for the widely spoken Sichuanese dialects of Chinese. To address this critical gap, we introduce WenetSpeech-Chuan, a 10,000-hour, richly annotated corpus constructed using our novel Chuan-Pipeline, a complete data processing framework for dialectal speech. To facilitate rigorous evaluation and demonstrate the corpus's effectiveness, we also release high-quality ASR and TTS benchmarks, WenetSpeech-Chuan-Eval, with manually verified transcriptions. Experiments show that models trained on WenetSpeech-Chuan achieve state-of-the-art performance among open-source systems and demonstrate results comparable to commercial services. As the largest open-source corpus for Sichuanese dialects, WenetSpeech-Chuan not only lowers the barrier to research in dialectal speech processing but also plays a crucial role in promoting AI equity and mitigating bias in speech technologies. The corpus, benchmarks, models, and receipts are publicly available on our project page.