🤖 AI Summary
This work addresses the long-standing scarcity of efficient, dialect-specific self-supervised models for multilingual Arabic speech processing. It presents the first self-supervised pretraining approach tailored to the diverse family of Arabic dialects, leveraging the Conformer architecture within the BEST-RQ framework and pretrained on 5,640 hours of web-crawled and publicly available speech data. The resulting model achieves state-of-the-art performance in dialect identification with fewer parameters and significantly outperforms existing general-purpose multilingual or non-Arabic monolingual models on automatic speech recognition tasks. These results demonstrate the effectiveness and superiority of domain-customized pretraining for low-resource, linguistically heterogeneous language varieties such as Arabic dialects.
📝 Abstract
We present Ara-BEST-RQ, a family of self-supervised learning (SSL) models specifically designed for multi-dialectal Arabic speech processing. Leveraging 5,640 hours of crawled Creative Commons speech and combining it with publicly available datasets, we pre-train conformer-based BEST-RQ models up to 600M parameters. Our models are evaluated on dialect identification (DID) and automatic speech recognition (ASR) tasks, achieving state-of-the-art performance on the former while using fewer parameters than competing models. We demonstrate that family-targeted pre-training on Arabic dialects significantly improves downstream performance compared to multilingual or monolingual models trained on non-Arabic data. All models, code, and pre-processed datasets will be publicly released to support reproducibility and further research in Arabic speech technologies.