Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This study addresses the lack of systematic quality validation in existing evaluation benchmarks for large language models (LLMs) in Arabic, which undermines the reliability of assessment results. To remedy this, we propose QIMMA, a novel evaluation framework that prioritizes data quality by introducing a rigorous preprocessing pipeline combining multi-model automatic filtering with human review to systematically cleanse and correct prominent Arabic benchmarks. Built upon LightEval and EvalPlus, our approach establishes a transparent evaluation infrastructure with fully disclosed model outputs to ensure reproducibility. The resulting benchmark comprises 52,000 high-quality, native Arabic samples spanning multiple domains and tasks—excluding code-related tasks—and provides a robust, reproducible foundation for advancing Arabic NLP research.

Technology Category

Application Category

📝 Abstract

We present QIMMA, a quality-assured Arabic LLM leaderboard that places systematic benchmark validation at its core. Rather than aggregating existing resources as-is, QIMMA applies a multi-model assessment pipeline combining automated LLM judgment with human review to surface and resolve systematic quality issues in well-established Arabic benchmarks before evaluation. The result is a curated, multi-domain, multi-task evaluation suite of over 52k samples, grounded predominantly in native Arabic content; code evaluation tasks are the sole exception, as they are inherently language-agnostic. Transparent implementation via LightEval, EvalPlus and public release of per-sample inference outputs make QIMMA a reproducible and community-extensible foundation for Arabic NLP evaluation.

Problem

Research questions and friction points this paper is trying to address.

Arabic benchmarks

LLM evaluation

benchmark reliability

quality assurance

systematic quality issues

Innovation

Methods, ideas, or system contributions that make the work stand out.

quality-assured evaluation

multi-model assessment

systematic benchmark validation

Arabic LLM benchmarking

reproducible NLP evaluation

🔎 Similar Papers

No similar papers found.