🤖 AI Summary
This work addresses the underdevelopment of robust speech technologies for low-resource dialects such as Wu Chinese, hindered by the scarcity of large-scale datasets, standardized evaluation benchmarks, and open-source models. To bridge this gap, we present WenetSpeech-Wu, the first large-scale, multi-dimensionally annotated open-source Wu Chinese speech corpus comprising approximately 8,000 hours of audio. Building upon this resource, we introduce WenetSpeech-Wu-Bench, the first standardized multi-task benchmark for Wu Chinese, encompassing six core tasks: automatic speech recognition (ASR), Wu-to-Mandarin translation, speaker attribute prediction, emotion recognition, text-to-speech synthesis (TTS), and intent-driven TTS. We also release a suite of strong open-source baseline models trained on this corpus, significantly advancing the Wu Chinese speech processing ecosystem and establishing the first systematic foundation for research in this domain.
📝 Abstract
Speech processing for low-resource dialects remains a fundamental challenge in developing inclusive and robust speech technologies. Despite its linguistic significance and large speaker population, the Wu dialect of Chinese has long been hindered by the lack of large-scale speech data, standardized evaluation benchmarks, and publicly available models. In this work, we present WenetSpeech-Wu, the first large-scale, multi-dimensionally annotated open-source speech corpus for the Wu dialect, comprising approximately 8,000 hours of diverse speech data. Building upon this dataset, we introduce WenetSpeech-Wu-Bench, the first standardized and publicly accessible benchmark for systematic evaluation of Wu dialect speech processing, covering automatic speech recognition (ASR), Wu-to-Mandarin translation, speaker attribute prediction, speech emotion recognition, text-to-speech (TTS) synthesis, and instruction-following TTS (instruct TTS). Furthermore, we release a suite of strong open-source models trained on WenetSpeech-Wu, establishing competitive performance across multiple tasks and empirically validating the effectiveness of the proposed dataset. Together, these contributions lay the foundation for a comprehensive Wu dialect speech processing ecosystem, and we open-source proposed datasets, benchmarks, and models to support future research on dialectal speech intelligence.