🤖 AI Summary
Existing large language models (LLMs) exhibit significant performance gaps and sociolinguistic biases in modeling Arabic dialects (DA), yet comprehensive, multidimensional evaluation frameworks remain absent. Method: We systematically assess nine state-of-the-art LLMs across eight major Arabic dialects using a novel, operationalizable DA evaluation framework—spanning fidelity, comprehension, generation quality, and diglossic register transfer—and conduct human-curated benchmarking, controlled prompt engineering, cross-dialect comparative analysis, and post-training bias attribution. Contribution/Results: We identify pronounced generative suppression of DA—not due to inherent capability limitations, but to dialect-suppressing mechanisms introduced during post-training. Few-shot prompting substantially mitigates this bias. We propose actionable prompt optimization strategies and an evaluation best-practice guide, offering both theoretical grounding and practical pathways toward equitable DA representation in LLMs.
📝 Abstract
Dialectal Arabic (DA) varieties are under-served by language technologies, particularly large language models (LLMs). This trend threatens to exacerbate existing social inequalities and limits LLM applications, yet the research community lacks operationalized performance measurements in DA. We present a framework that comprehensively assesses LLMs' DA modeling capabilities across four dimensions: fidelity, understanding, quality, and diglossia. We evaluate nine LLMs in eight DA varieties and provide practical recommendations. Our evaluation suggests that LLMs do not produce DA as well as they understand it, not because their DA fluency is poor, but because they are reluctant to generate DA. Further analysis suggests that current post-training can contribute to bias against DA, that few-shot examples can overcome this deficiency, and that otherwise no measurable features of input text correlate well with LLM DA performance.