IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of large language models (LLMs) on core Islamic disciplines by introducing IslamicMMLU, a benchmark comprising 10,013 multiple-choice questions spanning the Qur’an, Hadith, and Islamic jurisprudence (Fiqh). Notably, it incorporates a novel task to detect jurisprudential school bias, assessing LLMs’ tendencies across different Islamic intellectual traditions. The authors evaluate 26 prominent LLMs on this benchmark, revealing accuracy scores ranging from 39.8% to 93.8%, with Gemini 3 Flash achieving the highest performance. The results further indicate that specialized Arabic-language models generally underperform compared to state-of-the-art general-purpose models. An open leaderboard has been released to establish a new paradigm for evaluating religious knowledge comprehension and cultural sensitivity in artificial intelligence systems.

Technology Category

Application Category

📝 Abstract

Large language models are increasingly consulted for Islamic knowledge, yet no comprehensive benchmark evaluates their performance across core Islamic disciplines. We introduce IslamicMMLU, a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (jurisprudence, 4,000 questions). Each track is formed of multiple types of questions to examine LLMs capabilities handling different aspects of Islamic knowledge. The benchmark is used to create the IslamicMMLU public leaderboard for evaluating LLMs, and we initially evaluate 26 LLMs, where their averaged accuracy across the three tracks varied between 39.8\% to 93.8\% (by Gemini 3 Flash). The Quran track shows the widest span (99.3\% to 32.4\%), while the Fiqh track includes a novel madhab (Islamic school of jurisprudence) bias detection task revealing variable school-of-thought preferences across models. Arabic-specific models show mixed results, but they all underperform compared to frontier models. The evaluation code and leaderboard are made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Islamic knowledge

large language models

benchmark

Quran

Fiqh

Innovation

Methods, ideas, or system contributions that make the work stand out.

IslamicMMLU

benchmark

madhab bias detection