The Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Multilingual models (e.g., mBERT, XLM-RoBERTa) exhibit limited performance on languages sharing the Arabic script—such as Arabic, Persian, Urdu, and Sorani Kurdish—due to divergent orthographic conventions and cultural norms. Method: We propose a script-aware language-specialized pretraining paradigm, introducing the AS-RoBERTa family: RoBERTa models pretrained on large monolingual corpora per language, using a unified Arabic-script tokenization and masking strategy. Contribution/Results: Through ablation studies and confusion matrix analysis, we elucidate how script-level commonalities and language-specific differences jointly influence text classification. Fine-tuned AS-RoBERTa achieves consistent improvements over strong baselines on multilingual text classification, with average gains of 2–5 percentage points. These results empirically validate the efficacy and cross-lingual transferability of orthography-consistent, script-specialized pretraining.

Technology Category

Application Category

📝 Abstract

In natural language processing, multilingual models like mBERT and XLM-RoBERTa promise broad coverage but often struggle with languages that share a script yet differ in orthographic norms and cultural context. This issue is especially notable in Arabic-script languages such as Kurdish Sorani, Arabic, Persian, and Urdu. We introduce the Arabic Script RoBERTa (AS-RoBERTa) family: four RoBERTa-based models, each pre-trained on a large corpus tailored to its specific language. By focusing pre-training on language-specific script features and statistics, our models capture patterns overlooked by general-purpose models. When fine-tuned on classification tasks, AS-RoBERTa variants outperform mBERT and XLM-RoBERTa by 2 to 5 percentage points. An ablation study confirms that script-focused pre-training is central to these gains. Error analysis using confusion matrices shows how shared script traits and domain-specific content affect performance. Our results highlight the value of script-aware specialization for languages using the Arabic script and support further work on pre-training strategies rooted in script and language specificity.

Problem

Research questions and friction points this paper is trying to address.

Addressing multilingual model struggles with Arabic-script language variations

Improving text classification accuracy for Arabic-script languages

Enhancing script-specific pre-training for better language representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Arabic Script RoBERTa models for specific languages

Script-focused pre-training improves classification accuracy

Outperforms mBERT and XLM-RoBERTa by 2-5%

🔎 Similar Papers

No similar papers found.