MedHal: An Evaluation Dataset for Medical Hallucination Detection

📅 2025-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current medical hallucination detection lacks large-scale, multi-task evaluation benchmarks, limiting rigorous assessment of model reliability. To address this, we introduce MedHalluBench—the first large-scale, multi-source, multi-task benchmark for medical hallucination detection, built upon real-world clinical texts including electronic health records, biomedical literature, and question-answer pairs. It features fine-grained human annotations and fact-inconsistency attribution explanations grounded in clinical domain knowledge. Methodologically, we integrate collaborative annotation by clinical experts with an interpretable modeling framework that decouples supervision-based training from detection mechanisms. MedHalluBench is the first benchmark to unify scalability, task diversity, and attribution interpretability in medical hallucination detection, filling a critical gap in specialized evaluation resources. Experiments demonstrate that detectors trained on MedHalluBench achieve significantly higher accuracy than general-purpose baselines and reduce reliance on manual review by over 90%, thereby advancing the development of safe and trustworthy medical AI systems.

Technology Category

Application Category

📝 Abstract
We present MedHal, a novel large-scale dataset specifically designed to evaluate if models can detect hallucinations in medical texts. Current hallucination detection methods face significant limitations when applied to specialized domains like medicine, where they can have disastrous consequences. Existing medical datasets are either too small, containing only a few hundred samples, or focus on a single task like Question Answering or Natural Language Inference. MedHal addresses these gaps by: (1) incorporating diverse medical text sources and tasks; (2) providing a substantial volume of annotated samples suitable for training medical hallucination detection models; and (3) including explanations for factual inconsistencies to guide model learning. We demonstrate MedHal's utility by training and evaluating a baseline medical hallucination detection model, showing improvements over general-purpose hallucination detection approaches. This resource enables more efficient evaluation of medical text generation systems while reducing reliance on costly expert review, potentially accelerating the development of medical AI research.
Problem

Research questions and friction points this paper is trying to address.

Detecting hallucinations in medical texts accurately
Overcoming limitations of small medical datasets
Providing annotated samples for model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diverse medical text sources and tasks
Large volume of annotated samples
Explanations for factual inconsistencies
🔎 Similar Papers
No similar papers found.