Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Arabic large language model (LLM) evaluation suffers from a lack of systematic surveys, ambiguous benchmark categorization, and narrow evaluation dimensions—particularly weak temporal awareness, insufficient multi-turn dialogue coverage, and cultural misalignment in translated datasets. Method: We propose the first four-dimensional taxonomy for Arabic LLM evaluation, systematically organizing 40+ benchmarks across natural language understanding, knowledge reasoning, cultural adaptation, and domain-specific capabilities; identify pervasive cultural misalignment in translated data; and introduce a principled trade-off framework among natively constructed, translated, and synthetically generated data. Through comprehensive literature review, benchmark analysis, data provenance tracing, and evaluation metric comparison, we establish a reproducible assessment methodology. Contribution/Results: This work delivers the first structured, culturally grounded evaluation framework for Arabic NLP, enabling temporally aware, culturally sensitive, and methodologically rigorous LLM assessment—thereby advancing equitable and context-aware Arabic language technology development.

Technology Category

Application Category

📝 Abstract
This survey provides the first systematic review of Arabic LLM benchmarks, analyzing 40+ evaluation benchmarks across NLP tasks, knowledge domains, cultural understanding, and specialized capabilities. We propose a taxonomy organizing benchmarks into four categories: Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. Our analysis reveals significant progress in benchmark diversity while identifying critical gaps: limited temporal evaluation, insufficient multi-turn dialogue assessment, and cultural misalignment in translated datasets. We examine three primary approaches: native collection, translation, and synthetic generation discussing their trade-offs regarding authenticity, scale, and cost. This work serves as a comprehensive reference for Arabic NLP researchers, providing insights into benchmark methodologies, reproducibility standards, and evaluation metrics while offering recommendations for future development.
Problem

Research questions and friction points this paper is trying to address.

Systematically reviewing Arabic LLM benchmarks across diverse evaluation categories
Identifying critical gaps in temporal evaluation and cultural alignment
Analyzing benchmark creation methods and their trade-offs for Arabic NLP
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed taxonomy organizing benchmarks into four categories
Examined native collection, translation, and synthetic generation approaches
Identified critical gaps in temporal and dialogue evaluation
🔎 Similar Papers
No similar papers found.
A
Ahmed Alzubaidi
Technology Innovation Institute, Abu Dhabi, UAE
S
Shaikha Alsuwaidi
Technology Innovation Institute, Abu Dhabi, UAE
Basma El Amel Boussaha
Basma El Amel Boussaha
Lead Researcher @ tii.ae| PhD Université de Nantes
Natural Language ProcessingLarge Language ModelsArabic NLPDeep Learning
L
Leen AlQadi
Technology Innovation Institute, Abu Dhabi, UAE
O
Omar Alkaabi
Technology Innovation Institute, Abu Dhabi, UAE
M
Mohammed Alyafeai
Technology Innovation Institute, Abu Dhabi, UAE
H
Hamza Alobeidli
Technology Innovation Institute, Abu Dhabi, UAE
Hakim Hacid
Hakim Hacid
Technology Innovation Institute (TII), UAE
Machine LearningLLMDatabasesInformation RetrievalEdge ML