MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark

📅 2026-02-02
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the longstanding scarcity of high-quality open-source datasets and evaluation benchmarks for Arabic in medical natural language processing, which has hindered the development of multilingual large language models. To bridge this gap, the authors introduce MedAraBench, the first large-scale Arabic medical question-answering benchmark spanning 19 medical specialties and 5 difficulty levels. Data quality is ensured through manual digitization of professional medical materials, expert annotations, and a dual validation mechanism combining “LLM-as-a-Judge” with human review. The dataset and evaluation scripts are publicly released, and comprehensive evaluations across eight leading models—including GPT-5, Gemini 2.0 Flash, and Claude 4-Sonnet—reveal critical performance bottlenecks in Arabic medical tasks, thereby filling a crucial void in healthcare AI evaluation for the Arabic language.

Technology Category

Application Category

📝 Abstract
Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region. We then conducted extensive preprocessing and split the dataset into training and test sets to support future research efforts in the area. To assess the quality of the data, we adopted two frameworks, namely expert human evaluation and LLM-as-a-judge. Our dataset is diverse and of high quality, spanning 19 specialties and five difficulty levels. For benchmarking purposes, we assessed the performance of eight state-of-the-art open-source and proprietary models, such as GPT-5, Gemini 2.0 Flash, and Claude 4-Sonnet. Our findings highlight the need for further domain-specific enhancements. We release the dataset and evaluation scripts to broaden the diversity of medical data benchmarks, expand the scope of evaluation suites for LLMs, and enhance the multilingual capabilities of models for deployment in clinical settings.
Problem

Research questions and friction points this paper is trying to address.

Arabic
medical question answering
large language models
benchmark
multilingual NLP
Innovation

Methods, ideas, or system contributions that make the work stand out.

Arabic medical QA
large-scale dataset
multilingual LLM benchmarking
LLM-as-a-judge
medical domain adaptation
🔎 Similar Papers
No similar papers found.
M
Mouath Abu-Daoud
Engineering Division, New York University Abu Dhabi, UAE
L
Leen Kharouf
Engineering Division, New York University Abu Dhabi, UAE
O
Omar El Hajj
Engineering Division, New York University Abu Dhabi, UAE
D
Dana El Samad
Engineering Division, New York University Abu Dhabi, UAE
M
Mariam Al-Omari
Engineering Division, New York University Abu Dhabi, UAE
J
Jihad Mallat
Cleveland Clinic Abu Dhabi, UAE
K
Khaled Saleh
Cleveland Clinic Abu Dhabi, UAE
Nizar Habash
Nizar Habash
Professor of Computer Science, New York University Abu Dhabi
Natural Language ProcessingComputational LinguisticsArtificial Intelligence
F
Farah E. Shamout
Engineering Division, New York University Abu Dhabi, UAE