Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This study investigates the generalization capability of speech foundation models for automatic speech recognition (ASR) in low-resource regional dialects. To address the lack of benchmark resources for Bangla dialect ASR, we construct and publicly release Ben-10—the first high-quality dialectal speech dataset for Bangla (78 hours, covering 10 regional variants). Experiments reveal that state-of-the-art foundation models suffer substantial performance degradation—both in zero-shot and fine-tuned settings—on out-of-distribution (OOD) dialects, exposing critical robustness limitations. In contrast, dialect-specific modeling significantly improves accuracy. Methodologically, we introduce a linguistics-informed data curation pipeline and an OOD evaluation framework tailored to dialectal variation. Our work is the first to systematically characterize the limitations of foundation models in low-resource dialect ASR. We contribute Ben-10, open-source code, and strong baselines, establishing a new benchmark for dialect-aware ASR research.

Technology Category

Application Category

📝 Abstract

Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations but dialect specific model training alleviates the issue. Our dataset also serves as a out of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available

Problem

Research questions and friction points this paper is trying to address.

Evaluating ASR foundation models' generalization for low-resource dialect recognition

Investigating dialectal variation effects on speech recognition performance

Addressing limited annotated data for regional dialect ASR modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed 78-hour Bengali dialect speech corpus

Evaluated foundation models on dialect recognition performance

Proposed dialect-specific training to improve ASR accuracy

🔎 Similar Papers

Evaluation of state-of-the-art ASR Models in Child-Adult Interactions