WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Cantonese—spoken natively by approximately 84.9 million people—has long suffered from a severe scarcity of high-quality annotated speech data, hindering progress in automatic speech recognition (ASR) and text-to-speech (TTS). To address this, we introduce WenetSpeech-Yue, the first large-scale, multi-dimensionally annotated Cantonese speech corpus, comprising 21,800 hours of audio across ten domains. We propose WenetSpeech-Pipe, an integrated annotation pipeline enabling concurrent labeling of speech quality, speaker attributes, and fine-grained phoneme- and tone-aware transcripts. Additionally, we release WSYue-eval, the first comprehensive Cantonese evaluation benchmark. Annotation quality is rigorously ensured through a four-stage process: ASR-based pre-screening, rule- and model-guided text post-processing, multi-model consensus voting, and expert human verification. ASR and TTS models trained on WenetSpeech-Yue achieve state-of-the-art performance, significantly outperforming leading commercial systems and large language models, while demonstrating strong robustness across diverse real-world scenarios.

Technology Category

Application Category

📝 Abstract

The development of speech understanding and generation has been significantly accelerated by the availability of large-scale, high-quality speech datasets. Among these, ASR and TTS are regarded as the most established and fundamental tasks. However, for Cantonese (Yue Chinese), spoken by approximately 84.9 million native speakers worldwide, limited annotated resources have hindered progress and resulted in suboptimal ASR and TTS performance. To address this challenge, we propose WenetSpeech-Pipe, an integrated pipeline for building large-scale speech corpus with multi-dimensional annotation tailored for speech understanding and generation. It comprises six modules: Audio Collection, Speaker Attributes Annotation, Speech Quality Annotation, Automatic Speech Recognition, Text Postprocessing and Recognizer Output Voting, enabling rich and high-quality annotations. Based on this pipeline, we release WenetSpeech-Yue, the first large-scale Cantonese speech corpus with multi-dimensional annotation for ASR and TTS, covering 21,800 hours across 10 domains with annotations including ASR transcription, text confidence, speaker identity, age, gender, speech quality scores, among other annotations. We also release WSYue-eval, a comprehensive Cantonese benchmark with two components: WSYue-ASR-eval, a manually annotated set for evaluating ASR on short and long utterances, code-switching, and diverse acoustic conditions, and WSYue-TTS-eval, with base and coverage subsets for standard and generalization testing. Experimental results show that models trained on WenetSpeech-Yue achieve competitive results against state-of-the-art (SOTA) Cantonese ASR and TTS systems, including commercial and LLM-based models, highlighting the value of our dataset and pipeline.

Problem

Research questions and friction points this paper is trying to address.

Limited annotated Cantonese speech resources hinder ASR and TTS progress

Lack of large-scale multi-dimensional annotated corpus for Cantonese processing

Suboptimal performance in Cantonese speech recognition and synthesis systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrated pipeline for multi-dimensional speech annotation

First large-scale Cantonese corpus with rich annotations

Comprehensive benchmark for ASR and TTS evaluation

🔎 Similar Papers

How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models