SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning

📅 2026-04-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

157K/year
🤖 AI Summary
This study addresses the absence of a trustworthy reasoning benchmark for Arabic financial natural language processing tailored to Islamic finance, a gap that hinders the development of compliant intelligent assistants. To bridge this void, the authors present the first systematic integration of Sharia principles with financial domain knowledge, constructing a multitask Arabic reasoning benchmark spanning seven task categories. The benchmark is grounded in real-world regulatory, jurisprudential, and corporate documents, rigorously annotated and validated by domain experts, and accompanied by an instruction-tuning dataset and an open-ended evaluation framework driven by scoring rules. Evaluations across 19 mainstream large language models reveal moderate performance on identification tasks but significant deficiencies in causal reasoning and generation—particularly in event causality. The benchmark, evaluation framework, and fine-tuned models are publicly released, establishing a foundational resource for trustworthy AI research in Islamic finance.

Technology Category

Application Category

📝 Abstract
English financial NLP has progressed rapidly through benchmarks for sentiment, document understanding, and financial question answering, while Arabic financial NLP remains comparatively under-explored despite strong practical demand for trustworthy finance and Islamic-finance assistants. We introduce SAHM, a document-grounded benchmark and instruction-tuning dataset for Arabic financial NLP and Shari'ah-compliant reasoning. SAHM contains 14,380 expert-verified instances spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning, curated from authentic regulatory, juristic, and corporate sources. We evaluate 19 strong open and proprietary LLMs using task-specific metrics and rubric-based scoring for open-ended outputs, and find that Arabic fluency does not reliably translate to evidence-grounded financial reasoning: models are substantially stronger on recognition-style tasks than on generation and causal reasoning, with the largest gaps on event-cause reasoning. We release the benchmark, evaluation framework, and an instruction-tuned model to support future research on trustworthy Arabic financial NLP.
Problem

Research questions and friction points this paper is trying to address.

Arabic financial NLP
Shari'ah-compliant reasoning
benchmark
financial question answering
document-grounded reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Arabic Financial NLP
Shari'ah-compliant reasoning
document-grounded benchmark
instruction-tuning dataset
evidence-based reasoning
🔎 Similar Papers
No similar papers found.