🤖 AI Summary
This study addresses the lack of systematic quality validation in existing evaluation benchmarks for large language models (LLMs) in Arabic, which undermines the reliability of assessment results. To remedy this, we propose QIMMA, a novel evaluation framework that prioritizes data quality by introducing a rigorous preprocessing pipeline combining multi-model automatic filtering with human review to systematically cleanse and correct prominent Arabic benchmarks. Built upon LightEval and EvalPlus, our approach establishes a transparent evaluation infrastructure with fully disclosed model outputs to ensure reproducibility. The resulting benchmark comprises 52,000 high-quality, native Arabic samples spanning multiple domains and tasks—excluding code-related tasks—and provides a robust, reproducible foundation for advancing Arabic NLP research.
📝 Abstract
We present QIMMA, a quality-assured Arabic LLM leaderboard that places systematic benchmark validation at its core. Rather than aggregating existing resources as-is, QIMMA applies a multi-model assessment pipeline combining automated LLM judgment with human review to surface and resolve systematic quality issues in well-established Arabic benchmarks before evaluation. The result is a curated, multi-domain, multi-task evaluation suite of over 52k samples, grounded predominantly in native Arabic content; code evaluation tasks are the sole exception, as they are inherently language-agnostic. Transparent implementation via LightEval, EvalPlus and public release of per-sample inference outputs make QIMMA a reproducible and community-extensible foundation for Arabic NLP evaluation.