MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models

📅 2024-09-29
🏛️ arXiv.org
📈 Citations: 11
Influential: 0
📄 PDF
🤖 AI Summary
Medical large language models (LLMs) frequently generate hazardous hallucinations when responding to real-world patient health queries; however, existing evaluations rely predominantly on standardized exam questions, failing to reflect authentic clinical interaction scenarios. Method: We introduce MedHalu, the first fine-grained, multi-disease medical hallucination dataset, and propose MedHaluDetect—a novel evaluation framework supporting hallucination-type classification, span-level annotation, and multi-stakeholder assessment (medical experts, laypersons, and LLMs). Contribution/Results: Our analysis reveals that current LLMs significantly underperform both domain experts and even non-expert users in hallucination detection. To address this, we propose an “expert-in-the-loop” enhancement paradigm, integrating reasoning-injected fine-tuning with ensemble-based detection—boosting GPT-4’s average macro-F1 by 6.3 percentage points. This work constitutes the first systematic quantification and mitigation of hallucination identification deficits in real-world medical Q&A.

Technology Category

Application Category

📝 Abstract
The remarkable capabilities of large language models (LLMs) in language understanding and generation have not rendered them immune to hallucinations. LLMs can still generate plausible-sounding but factually incorrect or fabricated information. As LLM-empowered chatbots become popular, laypeople may frequently ask health-related queries and risk falling victim to these LLM hallucinations, resulting in various societal and healthcare implications. In this work, we conduct a pioneering study of hallucinations in LLM-generated responses to real-world healthcare queries from patients. We propose MedHalu, a carefully crafted first-of-its-kind medical hallucination dataset with a diverse range of health-related topics and the corresponding hallucinated responses from LLMs with labeled hallucination types and hallucinated text spans. We also introduce MedHaluDetect framework to evaluate capabilities of various LLMs in detecting hallucinations. We also employ three groups of evaluators -- medical experts, LLMs, and laypeople -- to study who are more vulnerable to these medical hallucinations. We find that LLMs are much worse than the experts. They also perform no better than laypeople and even worse in few cases in detecting hallucinations. To fill this gap, we propose expert-in-the-loop approach to improve hallucination detection through LLMs by infusing expert reasoning. We observe significant performance gains for all the LLMs with an average macro-F1 improvement of 6.3 percentage points for GPT-4.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLM hallucinations in real-world healthcare queries
Detecting medical misinformation in LLM-generated responses
Improving hallucination detection via expert-in-the-loop integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MedHalu benchmark for medical hallucinations
Proposes MedHaluDetect framework for evaluation
Expert-in-the-loop improves hallucination detection
🔎 Similar Papers
No similar papers found.