CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study addresses the critical issue of “faithfulness hallucinations” in large language models (LLMs) when generating hospital discharge summaries—errors that contradict electronic health records (EHRs) and jeopardize patient safety. The authors propose the first multi-agent framework integrating GraphRAG with a four-tier evidence classification system (E1–E4), which constructs patient-level knowledge graphs from EHRs to enable sentence-level hallucination detection and generate interpretable evidence chains. Combining multi-agent collaboration, structured evidence retrieval, and a fine-tuned Qwen3-14B model, the approach achieves an F1 score of 0.831 (90.9% recall, 76.5% precision) for detecting E4-class hallucinations on the Discharge-Me test set—representing a 50.0% relative improvement over baseline methods—and contributes a high-quality, reusable annotated dataset.

📝 Abstract

Discharge summaries require extracting critical information from lengthy electronic health records (EHRs), a process that is labor-intensive when performed manually. Large language models (LLMs) can improve generation efficiency; however, they are prone to producing faithfulness hallucinations, statements that contradict source records, posing direct risks to patient safety. To address this, we present CuraView, a multi-agent framework for sentence-level detection and evidence-grounded explanation of faithfulness hallucinations in discharge summaries. CuraView constructs a GraphRAG-based knowledge graph from patient-level EHRs and implements a closed-loop generation-detection pipeline with sentence-level evidence retrieval and classification spanning four evidence grades from strong support to direct contradiction (E1-E4), yielding structured and interpretable evidence chains. We evaluate CuraView on a subset of 250 patients from the Discharge-Me benchmark, with 50 patients held out for testing. Our fine-tuned Qwen3-14B detection model achieves an F1 of 0.831 on the safety-critical E4 metric (90.9% recall, 76.5% precision) and an F1 of 0.823 on E3+E4, representing a 50.0% relative improvement over the base model and outperforming RAGTruth-style and QAGS-style baselines. These results demonstrate that evidence-chain-based graph retrieval verification substantially improves the factual reliability of clinical documentation, while simultaneously producing reusable annotated datasets for downstream model training and distillation.

Problem

Research questions and friction points this paper is trying to address.

medical hallucination

faithfulness

discharge summaries

electronic health records

patient safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

GraphRAG

multi-agent framework

faithfulness hallucination detection