🤖 AI Summary
This study addresses the propensity of large language models to generate unsupported hallucinations in Islamic question answering and their lack of a mechanism to abstain when evidence is absent. To tackle these issues, the authors introduce ISLAMICFAITHQA, the first atomic-level bilingual benchmark for trustworthy religious QA, and propose an agent-based retrieval-augmented generation (RAG) framework that leverages structured tool calling for iterative retrieval and answer refinement. They align model rewards using Arabic-grounded supervised fine-tuning (SFT) reasoning pairs and bilingual preference data, and construct a verse-level retrieval corpus covering approximately 6,000 Quranic verses. Experiments demonstrate that the proposed approach significantly enhances model faithfulness and cross-lingual robustness, outperforming standard RAG on both Arabic and multilingual models, with even compact architectures such as Qwen3-4B achieving state-of-the-art performance.
📝 Abstract
LLMs are increasingly used for Islamic question answering, where ungrounded responses may carry serious religious consequences. Yet standard MCQ/MRC-style evaluations do not capture key real-world failure modes, notably free-form hallucinations and whether models appropriately abstain when evidence is lacking. To shed a light on this aspect we introduce ISLAMICFAITHQA, a 3,810-item bilingual (Arabic/English) generative benchmark with atomic single-gold answers, which enables direct measurement of hallucination and abstention. We additionally developed an end-to-end grounded Islamic modelling suite consisting of (i) 25K Arabic text-grounded SFT reasoning pairs, (ii) 5K bilingual preference samples for reward-guided alignment, and (iii) a verse-level Qur'an retrieval corpus of $\sim$6k atomic verses (ayat). Building on these resources, we develop an agentic Quran-grounding framework (agentic RAG) that uses structured tool calls for iterative evidence seeking and answer revision. Experiments across Arabic-centric and multilingual LLMs show that retrieval improves correctness and that agentic RAG yields the largest gains beyond standard RAG, achieving state-of-the-art performance and stronger Arabic-English robustness even with a small model (i.e., Qwen3 4B). We will make the experimental resources and datasets publicly available for the community.