Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of systematic quality validation in existing evaluation benchmarks for large language models (LLMs) in Arabic, which undermines the reliability of assessment results. To remedy this, we propose QIMMA, a novel evaluation framework that prioritizes data quality by introducing a rigorous preprocessing pipeline combining multi-model automatic filtering with human review to systematically cleanse and correct prominent Arabic benchmarks. Built upon LightEval and EvalPlus, our approach establishes a transparent evaluation infrastructure with fully disclosed model outputs to ensure reproducibility. The resulting benchmark comprises 52,000 high-quality, native Arabic samples spanning multiple domains and tasks—excluding code-related tasks—and provides a robust, reproducible foundation for advancing Arabic NLP research.
📝 Abstract
We present QIMMA, a quality-assured Arabic LLM leaderboard that places systematic benchmark validation at its core. Rather than aggregating existing resources as-is, QIMMA applies a multi-model assessment pipeline combining automated LLM judgment with human review to surface and resolve systematic quality issues in well-established Arabic benchmarks before evaluation. The result is a curated, multi-domain, multi-task evaluation suite of over 52k samples, grounded predominantly in native Arabic content; code evaluation tasks are the sole exception, as they are inherently language-agnostic. Transparent implementation via LightEval, EvalPlus and public release of per-sample inference outputs make QIMMA a reproducible and community-extensible foundation for Arabic NLP evaluation.
Problem

Research questions and friction points this paper is trying to address.

Arabic benchmarks
LLM evaluation
benchmark reliability
quality assurance
systematic quality issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

quality-assured evaluation
multi-model assessment
systematic benchmark validation
Arabic LLM benchmarking
reproducible NLP evaluation
🔎 Similar Papers
No similar papers found.
L
Leen AlQadi
Technology Innovation Institute, Abu Dhabi, UAE
A
Ahmed Alzubaidi
Technology Innovation Institute, Abu Dhabi, UAE
M
Mohammed Alyafeai
Technology Innovation Institute, Abu Dhabi, UAE
H
Hamza Alobeidli
Technology Innovation Institute, Abu Dhabi, UAE
M
Maitha Alhammadi
Technology Innovation Institute, Abu Dhabi, UAE
S
Shaikha Alsuwaidi
Technology Innovation Institute, Abu Dhabi, UAE
O
Omar Alkaabi
Technology Innovation Institute, Abu Dhabi, UAE
Basma El Amel Boussaha
Basma El Amel Boussaha
Lead Researcher @ tii.ae| PhD Université de Nantes
Natural Language ProcessingLarge Language ModelsArabic NLPDeep Learning
Hakim Hacid
Hakim Hacid
Technology Innovation Institute (TII), UAE
Machine LearningLLMDatabasesInformation RetrievalEdge ML