🤖 AI Summary
This study investigates the generalization capability of speech foundation models for automatic speech recognition (ASR) in low-resource regional dialects. To address the lack of benchmark resources for Bangla dialect ASR, we construct and publicly release Ben-10—the first high-quality dialectal speech dataset for Bangla (78 hours, covering 10 regional variants). Experiments reveal that state-of-the-art foundation models suffer substantial performance degradation—both in zero-shot and fine-tuned settings—on out-of-distribution (OOD) dialects, exposing critical robustness limitations. In contrast, dialect-specific modeling significantly improves accuracy. Methodologically, we introduce a linguistics-informed data curation pipeline and an OOD evaluation framework tailored to dialectal variation. Our work is the first to systematically characterize the limitations of foundation models in low-resource dialect ASR. We contribute Ben-10, open-source code, and strong baselines, establishing a new benchmark for dialect-aware ASR research.
📝 Abstract
Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations but dialect specific model training alleviates the issue. Our dataset also serves as a out of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available