AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Arabic and bilingual large language models achieve high scores on superficial linguistic tasks (e.g., spelling, lexical knowledge) but exhibit significant deficiencies in deep linguistic understanding—particularly in grammar, morphology, and syntax—revealing a critical misalignment between benchmark performance and actual linguistic competence. Method: We introduce ArabEval, the first fully human-annotated diagnostic benchmark for Arabic, covering five dimensions: grammar, morphology, spelling, reading comprehension, and syntax. It comprises 150 expert-crafted multiple-choice items explicitly designed to dissociate surface-level knowledge from deep reasoning. We propose a multidimensional diagnostic evaluation framework and open-source all data and evaluation code. Contribution/Results: Systematic evaluation of 35 state-of-the-art models demonstrates pervasive reliance on memorization and pattern matching, with pronounced failures in syntactic parsing and structural reasoning. ArabEval proves highly sensitive to genuine linguistic capability, establishing its validity as a diagnostic tool for uncovering latent model weaknesses.

Technology Category

Application Category

📝 Abstract
We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.
Problem

Research questions and friction points this paper is trying to address.

Evaluating Arabic linguistic capabilities of large language models
Assessing structural language understanding through grammar and syntax
Measuring gaps between memorization and true linguistic comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-annotated benchmark for Arabic linguistic evaluation
Tests five core categories through multiple-choice questions
Diagnostic framework isolating fundamental linguistic skills
🔎 Similar Papers
No similar papers found.
M
Mohammad Zbib
King Abdullah University of Science and Technology (KAUST), American University of Beirut (AUB)
Hasan Abed Al Kader Hammoud
Hasan Abed Al Kader Hammoud
King Abdullah University of Science and Technology
Deep LearningComputer VisionMachine Learning
S
Sina Mukalled
American University of Beirut (AUB)
N
Nadine Rizk
American University of Beirut (AUB)
F
Fatima Karnib
American University of Beirut (AUB)
Issam Lakkis
Issam Lakkis
American University of Beirut (AUB)
A
Ammar Mohanna
American University of Beirut (AUB)
Bernard Ghanem
Bernard Ghanem
Professor, King Abdullah University of Science and Technology
computer visionmachine learning