🤖 AI Summary
Arabic large language model (LLM) evaluation suffers from a lack of systematic surveys, ambiguous benchmark categorization, and narrow evaluation dimensions—particularly weak temporal awareness, insufficient multi-turn dialogue coverage, and cultural misalignment in translated datasets. Method: We propose the first four-dimensional taxonomy for Arabic LLM evaluation, systematically organizing 40+ benchmarks across natural language understanding, knowledge reasoning, cultural adaptation, and domain-specific capabilities; identify pervasive cultural misalignment in translated data; and introduce a principled trade-off framework among natively constructed, translated, and synthetically generated data. Through comprehensive literature review, benchmark analysis, data provenance tracing, and evaluation metric comparison, we establish a reproducible assessment methodology. Contribution/Results: This work delivers the first structured, culturally grounded evaluation framework for Arabic NLP, enabling temporally aware, culturally sensitive, and methodologically rigorous LLM assessment—thereby advancing equitable and context-aware Arabic language technology development.
📝 Abstract
This survey provides the first systematic review of Arabic LLM benchmarks, analyzing 40+ evaluation benchmarks across NLP tasks, knowledge domains, cultural understanding, and specialized capabilities. We propose a taxonomy organizing benchmarks into four categories: Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. Our analysis reveals significant progress in benchmark diversity while identifying critical gaps: limited temporal evaluation, insufficient multi-turn dialogue assessment, and cultural misalignment in translated datasets. We examine three primary approaches: native collection, translation, and synthetic generation discussing their trade-offs regarding authenticity, scale, and cost. This work serves as a comprehensive reference for Arabic NLP researchers, providing insights into benchmark methodologies, reproducibility standards, and evaluation metrics while offering recommendations for future development.