BALSAM: A Platform for Benchmarking Arabic Large Language Models

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Arabic large language models (LLMs) significantly underperform their English counterparts due to scarce training data, high dialectal and morphological complexity, and the absence of high-quality, comprehensive evaluation benchmarks. Existing Arabic evaluation datasets rely predominantly on static, publicly available resources, suffer from narrow task coverage, and lack blind-test protocols—leading to data contamination and biased assessments. To address these limitations, we introduce ArabEval: the first community-driven, unified evaluation platform for Arabic LLMs. ArabEval encompasses 78 NLP tasks across 14 categories, with 52K total samples—including 37K rigorously curated blind-test instances—and features dynamic dataset updates, strict data splits, and an open, transparent online evaluation framework. By mitigating data leakage and enforcing methodological rigor, ArabEval enhances evaluation fairness, cross-model comparability, and scalability, thereby establishing a new standard for Arabic LLM assessment.

Technology Category

Application Category

📝 Abstract

The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.

Problem

Research questions and friction points this paper is trying to address.

Addressing performance gap in Arabic LLMs compared to English

Overcoming data scarcity and linguistic diversity in Arabic LLMs

Improving Arabic benchmarks with comprehensive tasks and blind evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Community-driven Arabic LLM benchmark platform

78 NLP tasks across 14 categories

Centralized transparent blind evaluation platform

🔎 Similar Papers

No similar papers found.