From RAG to Agentic RAG for Faithful Islamic Question Answering

📅 2026-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the propensity of large language models to generate unsupported hallucinations in Islamic question answering and their lack of a mechanism to abstain when evidence is absent. To tackle these issues, the authors introduce ISLAMICFAITHQA, the first atomic-level bilingual benchmark for trustworthy religious QA, and propose an agent-based retrieval-augmented generation (RAG) framework that leverages structured tool calling for iterative retrieval and answer refinement. They align model rewards using Arabic-grounded supervised fine-tuning (SFT) reasoning pairs and bilingual preference data, and construct a verse-level retrieval corpus covering approximately 6,000 Quranic verses. Experiments demonstrate that the proposed approach significantly enhances model faithfulness and cross-lingual robustness, outperforming standard RAG on both Arabic and multilingual models, with even compact architectures such as Qwen3-4B achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
LLMs are increasingly used for Islamic question answering, where ungrounded responses may carry serious religious consequences. Yet standard MCQ/MRC-style evaluations do not capture key real-world failure modes, notably free-form hallucinations and whether models appropriately abstain when evidence is lacking. To shed a light on this aspect we introduce ISLAMICFAITHQA, a 3,810-item bilingual (Arabic/English) generative benchmark with atomic single-gold answers, which enables direct measurement of hallucination and abstention. We additionally developed an end-to-end grounded Islamic modelling suite consisting of (i) 25K Arabic text-grounded SFT reasoning pairs, (ii) 5K bilingual preference samples for reward-guided alignment, and (iii) a verse-level Qur'an retrieval corpus of $\sim$6k atomic verses (ayat). Building on these resources, we develop an agentic Quran-grounding framework (agentic RAG) that uses structured tool calls for iterative evidence seeking and answer revision. Experiments across Arabic-centric and multilingual LLMs show that retrieval improves correctness and that agentic RAG yields the largest gains beyond standard RAG, achieving state-of-the-art performance and stronger Arabic-English robustness even with a small model (i.e., Qwen3 4B). We will make the experimental resources and datasets publicly available for the community.
Problem

Research questions and friction points this paper is trying to address.

Islamic question answering
hallucination
abstention
faithful AI
religious consequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic RAG
Islamic Question Answering
Hallucination Measurement
Verse-level Retrieval
Bilingual Grounded Benchmark
🔎 Similar Papers
No similar papers found.