BALSAM: A Platform for Benchmarking Arabic Large Language Models

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Arabic large language models (LLMs) significantly underperform their English counterparts due to scarce training data, high dialectal and morphological complexity, and the absence of high-quality, comprehensive evaluation benchmarks. Existing Arabic evaluation datasets rely predominantly on static, publicly available resources, suffer from narrow task coverage, and lack blind-test protocols—leading to data contamination and biased assessments. To address these limitations, we introduce ArabEval: the first community-driven, unified evaluation platform for Arabic LLMs. ArabEval encompasses 78 NLP tasks across 14 categories, with 52K total samples—including 37K rigorously curated blind-test instances—and features dynamic dataset updates, strict data splits, and an open, transparent online evaluation framework. By mitigating data leakage and enforcing methodological rigor, ArabEval enhances evaluation fairness, cross-model comparability, and scalability, thereby establishing a new standard for Arabic LLM assessment.

Technology Category

Application Category

📝 Abstract
The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.
Problem

Research questions and friction points this paper is trying to address.

Addressing performance gap in Arabic LLMs compared to English
Overcoming data scarcity and linguistic diversity in Arabic LLMs
Improving Arabic benchmarks with comprehensive tasks and blind evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Community-driven Arabic LLM benchmark platform
78 NLP tasks across 14 categories
Centralized transparent blind evaluation platform
🔎 Similar Papers
No similar papers found.