Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe

📅 2025-08-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of a unified benchmark for modeling global dialects and regional languages by introducing Voxlect, the first large-scale multilingual dialectal speech foundation model benchmark. Voxlect encompasses over 30 language varieties—including English, Arabic, and Mandarin—and integrates more than two million utterances. Methodologically, it evaluates state-of-the-art speech foundation models on dialect classification and noise robustness, and extends evaluation to ASR data augmentation and TTS system assessment. Key contributions include: (1) proposing the first geographically aware multilingual dialect benchmark, revealing how dialectal geographic continuity influences representation learning; (2) demonstrating state-of-the-art performance in cross-lingual dialect classification and strong robustness under noisy conditions; and (3) empirically validating its generalizability to downstream tasks—significantly improving dialect identification accuracy and enhancing the reliability of speech generation evaluation.

Technology Category

Application Category

📝 Abstract
We present Voxlect, a novel benchmark for modeling dialects and regional languages worldwide using speech foundation models. Specifically, we report comprehensive benchmark evaluations on dialects and regional language varieties in English, Arabic, Mandarin and Cantonese, Tibetan, Indic languages, Thai, Spanish, French, German, Brazilian Portuguese, and Italian. Our study used over 2 million training utterances from 30 publicly available speech corpora that are provided with dialectal information. We evaluate the performance of several widely used speech foundation models in classifying speech dialects. We assess the robustness of the dialectal models under noisy conditions and present an error analysis that highlights modeling results aligned with geographic continuity. In addition to benchmarking dialect classification, we demonstrate several downstream applications enabled by Voxlect. Specifically, we show that Voxlect can be applied to augment existing speech recognition datasets with dialect information, enabling a more detailed analysis of ASR performance across dialectal variations. Voxlect is also used as a tool to evaluate the performance of speech generation systems. Voxlect is publicly available with the license of the RAIL family at: https://github.com/tiantiaf0627/voxlect.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking speech foundation models for dialect classification
Evaluating dialect model robustness in noisy conditions
Enabling downstream applications like ASR and speech generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for dialects using speech foundation models
Evaluates models on diverse languages and dialects
Augments ASR datasets with dialect information
🔎 Similar Papers
No similar papers found.
Tiantian Feng
Tiantian Feng
Postdoc Researcher
Health and BehaviorsWearable ComputingAffective ComputingSpeech and BiosignalResponsible ML
K
Kevin Huang
University of Southern California, Los Angeles, CA, USA
Anfeng Xu
Anfeng Xu
University of Southern California
Speech ProcessingMultimodal AILLMDeep Learning
X
Xuan Shi
University of Southern California, Los Angeles, CA, USA
T
Thanathai Lertpetchpun
University of Southern California, Los Angeles, CA, USA
J
Jihwan Lee
University of Southern California, Los Angeles, CA, USA
Y
Yoonjeong Lee
University of Southern California, Los Angeles, CA, USA
D
Dani Byrd
University of Southern California, Los Angeles, CA, USA
S
Shrikanth Narayanan
University of Southern California, Los Angeles, CA, USA