Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address data scarcity in Arabic multi-dialect automatic speech recognition (ASR), this work conducts a systematic fine-tuning study of the Whisper architecture across five major dialects—Gulf, Levantine, Iraqi, Egyptian, and Maghrebi—as well as Modern Standard Arabic (MSA). We propose three strategies: (1) balanced mixed-dialect training, (2) lightweight pre-fine-tuning on MSA, and (3) empirical analysis of cross-variant feature transferability. Results show that the mixed-dialect model achieves performance comparable to dialect-specific models; fine-tuning Whisper Tiny on only a small amount of MSA data surpasses the zero-shot performance of the larger Whisper Base; and empirical analysis reveals limited representational overlap between MSA and dialects, challenging prevailing assumptions about cross-variant transfer. This study establishes a reproducible, data-efficient paradigm for low-resource multi-dialect ASR, offering practical guidance for leveraging heterogeneous Arabic speech resources.

Technology Category

Application Category

📝 Abstract

Although commercial Arabic automatic speech recognition (ASR) systems support Modern Standard Arabic (MSA), they struggle with dialectal speech. We investigate the effect of fine-tuning OpenAI's Whisper on five major Arabic dialects (Gulf, Levantine, Iraqi, Egyptian, Maghrebi) using Mozilla Common Voice for MSA and the MASC dataset for dialectal speech. We evaluate MSA training size effects, benefits of pre-training on MSA data, and dialect-specific versus dialect-pooled models. We find that small amounts of MSA fine-tuning data yield substantial improvements for smaller models, matching larger non-fine-tuned models. While MSA pre-training shows minimal benefit, suggesting limited shared features between MSA and dialects, our dialect-pooled models perform comparably to dialect-specific ones. This indicates that pooling dialectal data, when properly balanced, can help address data scarcity in low-resource ASR without significant performance loss.

Problem

Research questions and friction points this paper is trying to address.

Improving Arabic dialect ASR via Whisper fine-tuning

Addressing data scarcity in multi-dialect Arabic speech recognition

Evaluating MSA and dialectal data pooling for ASR performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning Whisper for Arabic dialects

Using MSA and dialectal datasets

Pooling dialects to address scarcity

🔎 Similar Papers

M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper

2024-09-18arXiv.orgCitations: 0

Multilingual Distilwhisper: Efficient Distillation of Multi-Task Speech Models Via Language-Specific Experts

2023-11-02IEEE International Conference on Acoustics, Speech, and Signal ProcessingCitations: 20