MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multilingual and Arabic large language models exhibit suboptimal performance on legal question answering in low-resource languages—particularly for culturally and jurisprudentially nuanced domains such as Morocco’s hybrid legal system, which integrates Modern Standard Arabic, Maliki jurisprudence, customary law, and French civil law influences. Method: We introduce MoroccanLegalQA, the first multiple-choice benchmark tailored to this context, comprising over 1,700 single- and multi-answer questions. It features a novel evaluation design spanning multiple legal traditions, languages (Arabic/French), and answer formats, enabling culture-embedded legal reasoning assessment. Contribution/Results: Experiments reveal significant performance gaps for state-of-the-art multilingual and Arabic LLMs, underscoring the necessity of domain adaptation and culturally grounded modeling. MoroccanLegalQA establishes a rigorous, reproducible, and authoritative evaluation platform for Arabic legal AI and catalyzes development of specialized models for Islamic and hybrid legal systems.

Technology Category

Application Category

📝 Abstract
The rapid advancement of large language models (LLMs) has significantly propelled progress in natural language processing (NLP). However, their effectiveness in specialized, low-resource domains-such as Arabic legal contexts-remains limited. This paper introduces MizanQA (pronounced Mizan, meaning "scale" in Arabic, a universal symbol of justice), a benchmark designed to evaluate LLMs on Moroccan legal question answering (QA) tasks, characterised by rich linguistic and legal complexity. The dataset draws on Modern Standard Arabic, Islamic Maliki jurisprudence, Moroccan customary law, and French legal influences. Comprising over 1,700 multiple-choice questions, including multi-answer formats, MizanQA captures the nuances of authentic legal reasoning. Benchmarking experiments with multilingual and Arabic-focused LLMs reveal substantial performance gaps, highlighting the need for tailored evaluation metrics and culturally grounded, domain-specific LLM development.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on Moroccan legal QA tasks
Addressing low-resource Arabic legal domain limitations
Assessing cultural and legal complexity in LLM performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for Moroccan legal QA evaluation
Multilingual dataset with legal complexity
Tailored metrics for domain-specific LLMs
🔎 Similar Papers
No similar papers found.