3LM: Bridging Arabic, STEM, and Code through Benchmarking

πŸ“… 2025-07-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing Arabic large language model (LLM) benchmarks are heavily skewed toward linguistic, cultural, and religious domains, lacking dedicated, high-quality evaluation resources for critical applied areas such as STEM and code generation. Method: This work introduces the first systematic construction of three rigorously curated Arabic benchmarks: (1) a STEM question-answering dataset derived from locally adopted Arabic textbooks; (2) a generative STEM problem set combining synthetic generation with expert human validation; and (3) an Arabic code-generation benchmark produced via multi-round expert collaborative translation and functional verification. All datasets are publicly released, ensuring linguistic accuracy, cultural appropriateness, and technical rigor. Contribution/Results: Experiments demonstrate that these benchmarks substantially enhance the evaluation fidelity of Arabic LLMs on scientific reasoning and programming tasks. They fill a critical gap in non-English STEM assessment infrastructure and establish a reproducible, cross-domain, multilingual evaluation paradigm for fair LLM benchmarking.

Technology Category

Application Category

πŸ“ Abstract
Arabic is one of the most widely spoken languages in the world, yet efforts to develop and evaluate Large Language Models (LLMs) for Arabic remain relatively limited. Most existing Arabic benchmarks focus on linguistic, cultural, or religious content, leaving a significant gap in domains like STEM and code which are increasingly relevant for real-world LLM applications. To help bridge this gap, we present 3LM, a suite of three benchmarks designed specifically for Arabic. The first is a set of STEM-related question-answer pairs, naturally sourced from Arabic textbooks and educational worksheets. The second consists of synthetically generated STEM questions, created using the same sources. The third benchmark focuses on code generation, built through a careful translation of two widely used code benchmarks, incorporating a human-in-the-loop process with several rounds of review to ensure high-quality and faithful translations. We release all three benchmarks publicly to support the growth of Arabic LLM research in these essential but underrepresented areas.
Problem

Research questions and friction points this paper is trying to address.

Lack of Arabic LLM benchmarks in STEM domains
Limited Arabic code generation benchmarks availability
Need for high-quality Arabic educational content evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

STEM question-answer pairs from Arabic textbooks
Synthetic STEM questions from educational sources
Translated code benchmarks with human review
πŸ”Ž Similar Papers
No similar papers found.
Basma El Amel Boussaha
Basma El Amel Boussaha
Lead Researcher @ tii.ae| PhD UniversitΓ© de Nantes
Natural Language ProcessingLarge Language ModelsArabic NLPDeep Learning
L
Leen AlQadi
Technology Innovation Institute, Abu Dhabi, UAE
Mugariya Farooq
Mugariya Farooq
Mohamed Bin Zayed University of Artificial Intelligence, Technology Innovation Institute
GenomicsMachine LearningBio-informaticsNatural Language Processing
S
Shaikha Alsuwaidi
Technology Innovation Institute, Abu Dhabi, UAE
G
Giulia Campesan
Technology Innovation Institute, Abu Dhabi, UAE
A
Ahmed Alzubaidi
Technology Innovation Institute, Abu Dhabi, UAE
M
Mohammed Alyafeai
Technology Innovation Institute, Abu Dhabi, UAE
Hakim Hacid
Hakim Hacid
Technology Innovation Institute (TII), UAE
Machine LearningLLMDatabasesInformation RetrievalEdge ML