MedHal: An Evaluation Dataset for Medical Hallucination Detection

📅 2025-04-11

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Current medical hallucination detection lacks large-scale, multi-task evaluation benchmarks, limiting rigorous assessment of model reliability. To address this, we introduce MedHalluBench—the first large-scale, multi-source, multi-task benchmark for medical hallucination detection, built upon real-world clinical texts including electronic health records, biomedical literature, and question-answer pairs. It features fine-grained human annotations and fact-inconsistency attribution explanations grounded in clinical domain knowledge. Methodologically, we integrate collaborative annotation by clinical experts with an interpretable modeling framework that decouples supervision-based training from detection mechanisms. MedHalluBench is the first benchmark to unify scalability, task diversity, and attribution interpretability in medical hallucination detection, filling a critical gap in specialized evaluation resources. Experiments demonstrate that detectors trained on MedHalluBench achieve significantly higher accuracy than general-purpose baselines and reduce reliance on manual review by over 90%, thereby advancing the development of safe and trustworthy medical AI systems.

Technology Category

Application Category

📝 Abstract

We present MedHal, a novel large-scale dataset specifically designed to evaluate if models can detect hallucinations in medical texts. Current hallucination detection methods face significant limitations when applied to specialized domains like medicine, where they can have disastrous consequences. Existing medical datasets are either too small, containing only a few hundred samples, or focus on a single task like Question Answering or Natural Language Inference. MedHal addresses these gaps by: (1) incorporating diverse medical text sources and tasks; (2) providing a substantial volume of annotated samples suitable for training medical hallucination detection models; and (3) including explanations for factual inconsistencies to guide model learning. We demonstrate MedHal's utility by training and evaluating a baseline medical hallucination detection model, showing improvements over general-purpose hallucination detection approaches. This resource enables more efficient evaluation of medical text generation systems while reducing reliance on costly expert review, potentially accelerating the development of medical AI research.

Problem

Research questions and friction points this paper is trying to address.

Detecting hallucinations in medical texts accurately

Overcoming limitations of small medical datasets

Providing annotated samples for model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diverse medical text sources and tasks

Large volume of annotated samples

Explanations for factual inconsistencies

🔎 Similar Papers

MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models