Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects

πŸ“… 2026-01-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the significant gap in speech and language technologies for Chinese dialects compared to Mandarin, primarily due to the lack of semantic-aligned speech representations across dialects and Mandarin. The study proposes a novel approach that trains a speech encoder exclusively on automatic speech recognition (ASR) data to achieve semantic alignment among multiple Chinese dialects and Mandarin. The key contributions include the first method for learning semantically aligned speech embeddings for Chinese dialects using only ASR data, the release of the first open-source benchmark dataset supporting cross-dialect speech-to-speech retrieval, and empirical validation of the model’s effectiveness on this benchmark. The proposed encoder achieves state-of-the-art performance on Chinese dialect ASR tasks, substantially advancing the development of large speech models for dialectal Chinese.

Technology Category

Application Category

πŸ“ Abstract
Despite having hundreds of millions of speakers, Chinese dialects lag behind Mandarin in speech and language technologies. Most varieties are primarily spoken, making dialect-to-Mandarin speech-LLMs (large language models) more practical than dialect LLMs. Building dialect-to-Mandarin speech-LLMs requires speech representations with cross-dialect semantic alignment between Chinese dialects and Mandarin. In this paper, we achieve such a cross-dialect semantic alignment by training a speech encoder with ASR (automatic speech recognition)-only data, as demonstrated by speech-to-speech retrieval on a new benchmark of spoken Chinese varieties that we contribute. Our speech encoder further demonstrates state-of-the-art ASR performance on Chinese dialects. Together, our Chinese dialect benchmark, semantically aligned speech representations, and speech-to-speech retrieval evaluation lay the groundwork for future Chinese dialect speech-LLMs. We release the benchmark at https://github.com/kalvinchang/yubao.
Problem

Research questions and friction points this paper is trying to address.

Chinese dialects
semantic alignment
speech representations
speech-LLMs
Mandarin
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-dialect semantic alignment
speech encoder
ASR-only training
Chinese dialects
speech-to-speech retrieval
πŸ”Ž Similar Papers
No similar papers found.
K
Kalvin Chang
EECS Dept., UC Berkeley
Yiwen Shao
Yiwen Shao
Johns Hopkins University
speech recognitionmachine learningdeep learningNatural Language Processing
J
Jiahong Li
Tencent AI Labs, Tencent
D
Dong Yu
Tencent AI Labs, Tencent