WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cantonese—spoken natively by approximately 84.9 million people—has long suffered from a severe scarcity of high-quality annotated speech data, hindering progress in automatic speech recognition (ASR) and text-to-speech (TTS). To address this, we introduce WenetSpeech-Yue, the first large-scale, multi-dimensionally annotated Cantonese speech corpus, comprising 21,800 hours of audio across ten domains. We propose WenetSpeech-Pipe, an integrated annotation pipeline enabling concurrent labeling of speech quality, speaker attributes, and fine-grained phoneme- and tone-aware transcripts. Additionally, we release WSYue-eval, the first comprehensive Cantonese evaluation benchmark. Annotation quality is rigorously ensured through a four-stage process: ASR-based pre-screening, rule- and model-guided text post-processing, multi-model consensus voting, and expert human verification. ASR and TTS models trained on WenetSpeech-Yue achieve state-of-the-art performance, significantly outperforming leading commercial systems and large language models, while demonstrating strong robustness across diverse real-world scenarios.

Technology Category

Application Category

📝 Abstract
The development of speech understanding and generation has been significantly accelerated by the availability of large-scale, high-quality speech datasets. Among these, ASR and TTS are regarded as the most established and fundamental tasks. However, for Cantonese (Yue Chinese), spoken by approximately 84.9 million native speakers worldwide, limited annotated resources have hindered progress and resulted in suboptimal ASR and TTS performance. To address this challenge, we propose WenetSpeech-Pipe, an integrated pipeline for building large-scale speech corpus with multi-dimensional annotation tailored for speech understanding and generation. It comprises six modules: Audio Collection, Speaker Attributes Annotation, Speech Quality Annotation, Automatic Speech Recognition, Text Postprocessing and Recognizer Output Voting, enabling rich and high-quality annotations. Based on this pipeline, we release WenetSpeech-Yue, the first large-scale Cantonese speech corpus with multi-dimensional annotation for ASR and TTS, covering 21,800 hours across 10 domains with annotations including ASR transcription, text confidence, speaker identity, age, gender, speech quality scores, among other annotations. We also release WSYue-eval, a comprehensive Cantonese benchmark with two components: WSYue-ASR-eval, a manually annotated set for evaluating ASR on short and long utterances, code-switching, and diverse acoustic conditions, and WSYue-TTS-eval, with base and coverage subsets for standard and generalization testing. Experimental results show that models trained on WenetSpeech-Yue achieve competitive results against state-of-the-art (SOTA) Cantonese ASR and TTS systems, including commercial and LLM-based models, highlighting the value of our dataset and pipeline.
Problem

Research questions and friction points this paper is trying to address.

Limited annotated Cantonese speech resources hinder ASR and TTS progress
Lack of large-scale multi-dimensional annotated corpus for Cantonese processing
Suboptimal performance in Cantonese speech recognition and synthesis systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrated pipeline for multi-dimensional speech annotation
First large-scale Cantonese corpus with rich annotations
Comprehensive benchmark for ASR and TTS evaluation
L
Longhao Li
Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University
Z
Zhao Guo
Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University
H
Hongjie Chen
Institute of Artificial Intelligence (TeleAI), China Telecom
Y
Yuhang Dai
Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University
Z
Ziyu Zhang
Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University
H
Hongfei Xue
Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University
T
Tianlun Zuo
Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University
C
Chengyou Wang
Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University
S
Shuiyuan Wang
Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University
J
Jie Li
Institute of Artificial Intelligence (TeleAI), China Telecom
X
Xin Xu
Beijing AISHELL Technology Co., Ltd.
Hui Bu
Hui Bu
aishell
Speech Recognition、Speech databases and text corpora、Special topics on speech databases and
B
Binbin Zhang
WeNet Open Source Community
Ruibin Yuan
Ruibin Yuan
HKUST
Artificial IntelligenceMusic GenerationMusic Information RetrievalComputer Music
Ziya Zhou
Ziya Zhou
The Hong Kong University of Science and Technology
Music TechnologyNatural Language Processing
W
Wei Xue
Hong Kong University of Science and Technology
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University