From RAG to Agentic: Validating Islamic-Medicine Responses with LLM Agents

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of integrating classical Islamic medical knowledge—such as that found in Avicenna’s *Canon of Medicine* and *Prophetic Medicine*—with modern AI systems in a trustworthy, culturally grounded manner. Method: We propose Tibbe-AG, an evaluation framework targeting 30 preventive and holistic healthcare questions, employing three synergistic paradigms: direct answering, retrieval-augmented generation (RAG), and self-critique. We introduce the culture-adapted 3C3H quality scoring system and a scientific self-critique filtering mechanism, coupled with LLM-agent adjudication. Our methodology integrates Islamic corpus–driven prompt engineering, multi-stage LLM agent chains (generation → self-evaluation → scoring), and a structured assessment protocol. Contribution/Results: Experiments show RAG improves factual accuracy by 13%, with self-critique yielding an additional 10% gain. Qwen2-7B achieves optimal overall performance. This work establishes a novel paradigm for culturally sensitive, interpretable, and safe AI-assisted healthcare.

Technology Category

Application Category

📝 Abstract
Centuries-old Islamic medical texts like Avicenna's Canon of Medicine and the Prophetic Tibb-e-Nabawi encode a wealth of preventive care, nutrition, and holistic therapies, yet remain inaccessible to many and underutilized in modern AI systems. Existing language-model benchmarks focus narrowly on factual recall or user preference, leaving a gap in validating culturally grounded medical guidance at scale. We propose a unified evaluation pipeline, Tibbe-AG, that aligns 30 carefully curated Prophetic-medicine questions with human-verified remedies and compares three LLMs (LLaMA-3, Mistral-7B, Qwen2-7B) under three configurations: direct generation, retrieval-augmented generation, and a scientific self-critique filter. Each answer is then assessed by a secondary LLM serving as an agentic judge, yielding a single 3C3H quality score. Retrieval improves factual accuracy by 13%, while the agentic prompt adds another 10% improvement through deeper mechanistic insight and safety considerations. Our results demonstrate that blending classical Islamic texts with retrieval and self-evaluation enables reliable, culturally sensitive medical question-answering.
Problem

Research questions and friction points this paper is trying to address.

Validating culturally grounded Islamic medical guidance using LLMs
Assessing accuracy of retrieval-augmented generation in medical responses
Improving reliability of AI-driven Islamic medicine recommendations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified evaluation pipeline for Islamic medicine
Agentic judge with 3C3H quality scoring
Retrieval and self-critique improve accuracy
🔎 Similar Papers
No similar papers found.
M
Mohammad Amaan Sayeed
Mohamed bin Zayed University of Artificial Intelligence, UAE
M
Mohammed Talha Alam
Mohamed bin Zayed University of Artificial Intelligence, UAE
Raza Imam
Raza Imam
Mohamed Bin Zayed University of Artificial Intelligence
Machine LearningAI for HealthcareMultimodal AIVision and Language
Shahab Saquib Sohail
Shahab Saquib Sohail
Senior Assistant Professor, VIT Bhopal University
Computational Social ScienceComputational IntelligenceRecommender SystemAI and SocietyLLM
A
Amir Hussain
Edinburgh Napier University, UK