Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the generalization capability of speech foundation models for automatic speech recognition (ASR) in low-resource regional dialects. To address the lack of benchmark resources for Bangla dialect ASR, we construct and publicly release Ben-10—the first high-quality dialectal speech dataset for Bangla (78 hours, covering 10 regional variants). Experiments reveal that state-of-the-art foundation models suffer substantial performance degradation—both in zero-shot and fine-tuned settings—on out-of-distribution (OOD) dialects, exposing critical robustness limitations. In contrast, dialect-specific modeling significantly improves accuracy. Methodologically, we introduce a linguistics-informed data curation pipeline and an OOD evaluation framework tailored to dialectal variation. Our work is the first to systematically characterize the limitations of foundation models in low-resource dialect ASR. We contribute Ben-10, open-source code, and strong baselines, establishing a new benchmark for dialect-aware ASR research.

Technology Category

Application Category

📝 Abstract
Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations but dialect specific model training alleviates the issue. Our dataset also serves as a out of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available
Problem

Research questions and friction points this paper is trying to address.

Evaluating ASR foundation models' generalization for low-resource dialect recognition
Investigating dialectal variation effects on speech recognition performance
Addressing limited annotated data for regional dialect ASR modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed 78-hour Bengali dialect speech corpus
Evaluated foundation models on dialect recognition performance
Proposed dialect-specific training to improve ASR accuracy
🔎 Similar Papers
Tawsif Tashwar Dipto
Tawsif Tashwar Dipto
Islamic University of Technology
Computer VisionNatural Language ProcessingLow Resource LanguagesAutomatic Speech Recognition
A
Azmol Hossain
Bengali.AI
R
Rubayet Sabbir Faruque
Brac University, Bangladesh
M
Md. Rezuwan Hassan
Brac University, Bangladesh
K
Kanij Fatema
Brac University, Bangladesh
T
Tanmoy Shome
Brac University, Bangladesh
R
Ruwad Naswan
Bengali.AI
M
Md. Foriduzzaman Zihad
Bengali.AI
M
Mohaymen Ul Anam
Islamic University of Technology, Bangladesh
Nazia Tasnim
Nazia Tasnim
Boston University
PEFT & Model EditingComputer VisionExplainable AIMultimodal Systems
Hasan Mahmud
Hasan Mahmud
Postdoctoral Research Associate, Rochester Institute of Technology
Information SystemsAlgorithmic decision-makingHCI/Human-AI interaction
M
Md. Kamrul Hasan
Islamic University of Technology, Bangladesh
Md. Mehedi Hasan Shawon
Md. Mehedi Hasan Shawon
Lecturer, BSRM School of Engineering, BRAC University
Health InformaticsMedical ImagingData SciecneArtificial IntelligenceExplainable AI
Farig Sadeque
Farig Sadeque
Associate Professor, BRAC University
Natural Language ProcessingComputational Social Science
T
Tahsin Reasat
Bengali.AI