From RAG to Agentic: Validating Islamic-Medicine Responses with LLM Agents

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study addresses the challenge of integrating classical Islamic medical knowledge—such as that found in Avicenna’s *Canon of Medicine* and *Prophetic Medicine*—with modern AI systems in a trustworthy, culturally grounded manner. Method: We propose Tibbe-AG, an evaluation framework targeting 30 preventive and holistic healthcare questions, employing three synergistic paradigms: direct answering, retrieval-augmented generation (RAG), and self-critique. We introduce the culture-adapted 3C3H quality scoring system and a scientific self-critique filtering mechanism, coupled with LLM-agent adjudication. Our methodology integrates Islamic corpus–driven prompt engineering, multi-stage LLM agent chains (generation → self-evaluation → scoring), and a structured assessment protocol. Contribution/Results: Experiments show RAG improves factual accuracy by 13%, with self-critique yielding an additional 10% gain. Qwen2-7B achieves optimal overall performance. This work establishes a novel paradigm for culturally sensitive, interpretable, and safe AI-assisted healthcare.

Technology Category

Application Category

📝 Abstract

Centuries-old Islamic medical texts like Avicenna's Canon of Medicine and the Prophetic Tibb-e-Nabawi encode a wealth of preventive care, nutrition, and holistic therapies, yet remain inaccessible to many and underutilized in modern AI systems. Existing language-model benchmarks focus narrowly on factual recall or user preference, leaving a gap in validating culturally grounded medical guidance at scale. We propose a unified evaluation pipeline, Tibbe-AG, that aligns 30 carefully curated Prophetic-medicine questions with human-verified remedies and compares three LLMs (LLaMA-3, Mistral-7B, Qwen2-7B) under three configurations: direct generation, retrieval-augmented generation, and a scientific self-critique filter. Each answer is then assessed by a secondary LLM serving as an agentic judge, yielding a single 3C3H quality score. Retrieval improves factual accuracy by 13%, while the agentic prompt adds another 10% improvement through deeper mechanistic insight and safety considerations. Our results demonstrate that blending classical Islamic texts with retrieval and self-evaluation enables reliable, culturally sensitive medical question-answering.

Problem

Research questions and friction points this paper is trying to address.

Validating culturally grounded Islamic medical guidance using LLMs

Assessing accuracy of retrieval-augmented generation in medical responses

Improving reliability of AI-driven Islamic medicine recommendations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified evaluation pipeline for Islamic medicine

Agentic judge with 3C3H quality scoring

Retrieval and self-critique improve accuracy

🔎 Similar Papers

A RAG-based Question Answering System Proposal for Understanding Islam: MufassirQAS LLM