🤖 AI Summary
Arabic automatic speech recognition (ASR) faces challenges including linguistic complexity, scarcity of open-source models, and insufficient dialectal coverage; existing work predominantly targets Modern Standard Arabic (MSA), neglecting Classical Arabic (CA) and multi-dialect joint modeling. This paper introduces the first open-source, end-to-end ASR model jointly supporting MSA and CA, built upon the FastConformer architecture. Our approach integrates large-scale data preprocessing, multi-task learning, and phoneme-aware training. On standard MSA benchmarks, the model achieves state-of-the-art (SOTA) performance; for CA diacritized recognition—a previously unaddressed task—it establishes the first SOTA accuracy while maintaining strong generalization to MSA. The complete model and training framework are publicly released, providing a scalable foundation for multi-dialect Arabic speech understanding.
📝 Abstract
Despite Arabic being one of the most widely spoken languages, the development of Arabic Automatic Speech Recognition (ASR) systems faces significant challenges due to the language's complexity, and only a limited number of public Arabic ASR models exist. While much of the focus has been on Modern Standard Arabic (MSA), there is considerably less attention given to the variations within the language. This paper introduces a universal methodology for Arabic speech and text processing designed to address unique challenges of the language. Using this methodology, we train two novel models based on the FastConformer architecture: one designed specifically for MSA and the other, the first unified public model for both MSA and Classical Arabic (CA). The MSA model sets a new benchmark with state-of-the-art (SOTA) performance on related datasets, while the unified model achieves SOTA accuracy with diacritics for CA while maintaining strong performance for MSA. To promote reproducibility, we open-source the models and their training recipes.