Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Arabic large language model (LLM) evaluation suffers from a lack of systematic surveys, ambiguous benchmark categorization, and narrow evaluation dimensions—particularly weak temporal awareness, insufficient multi-turn dialogue coverage, and cultural misalignment in translated datasets. Method: We propose the first four-dimensional taxonomy for Arabic LLM evaluation, systematically organizing 40+ benchmarks across natural language understanding, knowledge reasoning, cultural adaptation, and domain-specific capabilities; identify pervasive cultural misalignment in translated data; and introduce a principled trade-off framework among natively constructed, translated, and synthetically generated data. Through comprehensive literature review, benchmark analysis, data provenance tracing, and evaluation metric comparison, we establish a reproducible assessment methodology. Contribution/Results: This work delivers the first structured, culturally grounded evaluation framework for Arabic NLP, enabling temporally aware, culturally sensitive, and methodologically rigorous LLM assessment—thereby advancing equitable and context-aware Arabic language technology development.

Technology Category

Application Category

📝 Abstract

This survey provides the first systematic review of Arabic LLM benchmarks, analyzing 40+ evaluation benchmarks across NLP tasks, knowledge domains, cultural understanding, and specialized capabilities. We propose a taxonomy organizing benchmarks into four categories: Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. Our analysis reveals significant progress in benchmark diversity while identifying critical gaps: limited temporal evaluation, insufficient multi-turn dialogue assessment, and cultural misalignment in translated datasets. We examine three primary approaches: native collection, translation, and synthetic generation discussing their trade-offs regarding authenticity, scale, and cost. This work serves as a comprehensive reference for Arabic NLP researchers, providing insights into benchmark methodologies, reproducibility standards, and evaluation metrics while offering recommendations for future development.

Problem

Research questions and friction points this paper is trying to address.

Systematically reviewing Arabic LLM benchmarks across diverse evaluation categories

Identifying critical gaps in temporal evaluation and cultural alignment

Analyzing benchmark creation methods and their trade-offs for Arabic NLP

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed taxonomy organizing benchmarks into four categories

Examined native collection, translation, and synthetic generation approaches

Identified critical gaps in temporal and dialogue evaluation

🔎 Similar Papers

No similar papers found.