WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address the critical bottleneck of insufficient large-scale, open-source speech data for mainstream Chinese dialects—particularly Sichuanese—which severely hinders ASR and TTS development, this work introduces WenetSpeech-Chuan, the largest publicly available Sichuanese speech corpus to date (10,000 hours). We propose Chuan-Pipeline, an end-to-end dialect data processing framework integrating ASR-based pre-annotation, text-speech alignment, pronunciation variant modeling, and multi-stage human verification to enable efficient data cleaning and fine-grained annotation. Leveraging this corpus, we release a standardized ASR/TTS benchmark, substantially lowering barriers to dialect speech research. Models trained on WenetSpeech-Chuan achieve production-grade performance within open-source ecosystems and demonstrate exceptional fairness and bias mitigation across multiple dialects in rigorous cross-dialect evaluation.

Technology Category

Application Category

📝 Abstract

The scarcity of large-scale, open-source data for dialects severely hinders progress in speech technology, a challenge particularly acute for the widely spoken Sichuanese dialects of Chinese. To address this critical gap, we introduce WenetSpeech-Chuan, a 10,000-hour, richly annotated corpus constructed using our novel Chuan-Pipeline, a complete data processing framework for dialectal speech. To facilitate rigorous evaluation and demonstrate the corpus's effectiveness, we also release high-quality ASR and TTS benchmarks, WenetSpeech-Chuan-Eval, with manually verified transcriptions. Experiments show that models trained on WenetSpeech-Chuan achieve state-of-the-art performance among open-source systems and demonstrate results comparable to commercial services. As the largest open-source corpus for Sichuanese dialects, WenetSpeech-Chuan not only lowers the barrier to research in dialectal speech processing but also plays a crucial role in promoting AI equity and mitigating bias in speech technologies. The corpus, benchmarks, models, and receipts are publicly available on our project page.

Problem

Research questions and friction points this paper is trying to address.

Addressing the scarcity of large-scale open-source data for Sichuanese dialects

Providing rich annotations and benchmarks for dialectal speech processing research

Lowering barriers and mitigating bias in speech technology for underrepresented dialects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale Sichuanese corpus with rich annotation

Novel Chuan-Pipeline for dialectal speech processing

High-quality ASR and TTS benchmarks released

🔎 Similar Papers

No similar papers found.