Fact-Controlled Diagnosis of Hallucinations in Medical Text Summarization

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Hallucinations in large language models (LLMs) pose serious risks to clinical decision-making in medical dialogue summarization, yet existing research is scarce and general-purpose hallucination detectors perform poorly in clinical settings. Method: We introduce the first dual-track expert-annotated dataset featuring both controlled factual omissions and naturally occurring hallucinations, and propose a fact-controlled hallucination generation paradigm. We design an interpretable fact-counting detection framework integrating factual extraction-and-comparison, fine-tuned and prompt-engineered LLM-based identification, and Leave-N-out for controllable data construction. Contribution/Results: Our method achieves a 27.3% improvement in F1-score for natural hallucination detection. We release the first clinical-hallucination-specific evaluation suite—comprising two benchmark datasets and multi-dimensional metrics—validated and adopted by domain experts.

Technology Category

Application Category

📝 Abstract
Hallucinations in large language models (LLMs) during summarization of patient-clinician dialogues pose significant risks to patient care and clinical decision-making. However, the phenomenon remains understudied in the clinical domain, with uncertainty surrounding the applicability of general-domain hallucination detectors. The rarity and randomness of hallucinations further complicate their investigation. In this paper, we conduct an evaluation of hallucination detection methods in the medical domain, and construct two datasets for the purpose: A fact-controlled Leave-N-out dataset -- generated by systematically removing facts from source dialogues to induce hallucinated content in summaries; and a natural hallucination dataset -- arising organically during LLM-based medical summarization. We show that general-domain detectors struggle to detect clinical hallucinations, and that performance on fact-controlled hallucinations does not reliably predict effectiveness on natural hallucinations. We then develop fact-based approaches that count hallucinations, offering explainability not available with existing methods. Notably, our LLM-based detectors, which we developed using fact-controlled hallucinations, generalize well to detecting real-world clinical hallucinations. This research contributes a suite of specialized metrics supported by expert-annotated datasets to advance faithful clinical summarization systems.
Problem

Research questions and friction points this paper is trying to address.

Detecting hallucinations in medical text summarization by LLMs
Evaluating general-domain hallucination detectors in clinical settings
Developing explainable fact-based methods for clinical hallucination detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fact-controlled Leave-N-out dataset generation
Fact-based hallucination counting methods
LLM-based detectors for clinical hallucinations
🔎 Similar Papers
No similar papers found.
B
BN Suhas
Amazon, USA
Han-Chin Shing
Han-Chin Shing
Amazon Web Service
natural language processingmental healthclinical NLP
L
Lei Xu
Amazon, USA
M
Mitch Strong
Amazon, USA
J
Jon Burnsky
Amazon, USA
J
Jessica Ofor
Amazon, USA
J
Jordan R. Mason
Amazon, USA
S
Susan Chen
Amazon, USA
S
Sundararajan Srinivasan
Amazon, USA
C
Chaitanya P. Shivade
Amazon, USA
J
Jack Moriarty
Amazon, USA
Joseph Paul Cohen
Joseph Paul Cohen
Amazon, AIMI (Stanford University), Mila (Quebec AI Institute)
Medical ImagingExplainable AIGenomicsComputer VisionRepresentation Learning