🤖 AI Summary
This work addresses the longstanding scarcity of high-quality open-source datasets and evaluation benchmarks for Arabic in medical natural language processing, which has hindered the development of multilingual large language models. To bridge this gap, the authors introduce MedAraBench, the first large-scale Arabic medical question-answering benchmark spanning 19 medical specialties and 5 difficulty levels. Data quality is ensured through manual digitization of professional medical materials, expert annotations, and a dual validation mechanism combining “LLM-as-a-Judge” with human review. The dataset and evaluation scripts are publicly released, and comprehensive evaluations across eight leading models—including GPT-5, Gemini 2.0 Flash, and Claude 4-Sonnet—reveal critical performance bottlenecks in Arabic medical tasks, thereby filling a crucial void in healthcare AI evaluation for the Arabic language.
📝 Abstract
Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region. We then conducted extensive preprocessing and split the dataset into training and test sets to support future research efforts in the area. To assess the quality of the data, we adopted two frameworks, namely expert human evaluation and LLM-as-a-judge. Our dataset is diverse and of high quality, spanning 19 specialties and five difficulty levels. For benchmarking purposes, we assessed the performance of eight state-of-the-art open-source and proprietary models, such as GPT-5, Gemini 2.0 Flash, and Claude 4-Sonnet. Our findings highlight the need for further domain-specific enhancements. We release the dataset and evaluation scripts to broaden the diversity of medical data benchmarks, expand the scope of evaluation suites for LLMs, and enhance the multilingual capabilities of models for deployment in clinical settings.