3LM: Bridging Arabic, STEM, and Code through Benchmarking

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing Arabic large language model (LLM) benchmarks are heavily skewed toward linguistic, cultural, and religious domains, lacking dedicated, high-quality evaluation resources for critical applied areas such as STEM and code generation. Method: This work introduces the first systematic construction of three rigorously curated Arabic benchmarks: (1) a STEM question-answering dataset derived from locally adopted Arabic textbooks; (2) a generative STEM problem set combining synthetic generation with expert human validation; and (3) an Arabic code-generation benchmark produced via multi-round expert collaborative translation and functional verification. All datasets are publicly released, ensuring linguistic accuracy, cultural appropriateness, and technical rigor. Contribution/Results: Experiments demonstrate that these benchmarks substantially enhance the evaluation fidelity of Arabic LLMs on scientific reasoning and programming tasks. They fill a critical gap in non-English STEM assessment infrastructure and establish a reproducible, cross-domain, multilingual evaluation paradigm for fair LLM benchmarking.

Technology Category

Application Category

📝 Abstract

Arabic is one of the most widely spoken languages in the world, yet efforts to develop and evaluate Large Language Models (LLMs) for Arabic remain relatively limited. Most existing Arabic benchmarks focus on linguistic, cultural, or religious content, leaving a significant gap in domains like STEM and code which are increasingly relevant for real-world LLM applications. To help bridge this gap, we present 3LM, a suite of three benchmarks designed specifically for Arabic. The first is a set of STEM-related question-answer pairs, naturally sourced from Arabic textbooks and educational worksheets. The second consists of synthetically generated STEM questions, created using the same sources. The third benchmark focuses on code generation, built through a careful translation of two widely used code benchmarks, incorporating a human-in-the-loop process with several rounds of review to ensure high-quality and faithful translations. We release all three benchmarks publicly to support the growth of Arabic LLM research in these essential but underrepresented areas.

Problem

Research questions and friction points this paper is trying to address.

Lack of Arabic LLM benchmarks in STEM domains

Limited Arabic code generation benchmarks availability

Need for high-quality Arabic educational content evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

STEM question-answer pairs from Arabic textbooks

Synthetic STEM questions from educational sources

Translated code benchmarks with human review

🔎 Similar Papers

No similar papers found.