Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses reliability issues—including textual misquotation, jurisprudential inaccuracies, and cultural incoherence—in large language models (LLMs) deployed for Islamic guidance. To this end, we propose the first Muslim-perspective-driven, dual-agent automated evaluation framework. Our method integrates six-dimensional quantitative scoring (e.g., Islamic accuracy: 3.93/5; citation quality: 3.38/5) with five-dimensional qualitative pairwise comparative analysis, enabling dual validation of citation fidelity and semantic consistency. Evaluating GPT-4o, Ansari AI, and Fanar, we find GPT-4o achieves the highest quantitative scores, while Ansari AI outperforms others in qualitative dimensions 116 times; Fanar, though comparatively weaker overall, demonstrates innovative cultural adaptation. Our contributions include a community-co-developed, religion-sensitive evaluation benchmark and a reproducible dual-agent assessment paradigm—advancing trustworthy LLM evaluation in high-stakes domains. (149 words)

Technology Category

Application Category

📝 Abstract
Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs. Our dual-agent framework uses a quantitative agent for citation verification and six-dimensional scoring (e.g., Structure, Islamic Consistency, Citations) and a qualitative agent for five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality). GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong performance, models still fall short in reliably producing accurate Islamic content and citations -- a paramount requirement in faith-sensitive writing. GPT-4o had the highest mean quantitative score (3.90/5), while Ansari AI led qualitative pairwise wins (116/200). Fanar, though trailing, introduces innovations for Islamic and Arabic contexts. This study underscores the need for community-driven benchmarks centering Muslim perspectives, offering an early step toward more reliable AI in Islamic knowledge and other high-stakes domains such as medicine, law, and journalism.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM accuracy in generating Islamic religious content
Assessing citation reliability and jurisprudential correctness in AI responses
Addressing cultural consistency in faith-sensitive AI-generated writings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-agent framework for evaluation
Quantitative agent verifies citations and scores
Qualitative agent compares tone and depth
🔎 Similar Papers
No similar papers found.