Mitigating Hallucinations in Healthcare LLMs with Granular Fact-Checking and Domain-Specific Adaptation

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Medical large language models (LLMs) frequently generate hallucinated outputs, posing significant risks to clinical decision-making safety. Method: This paper proposes a decoupled fact-checking framework featuring an LLM-independent, fine-grained proposition-level verification module that jointly performs numerical consistency checking and discrete logical reasoning, augmented by a domain-specific summarization model fine-tuned on MIMIC-III. Contribution/Results: Departing from end-to-end paradigms, our approach enables EHR-driven, interpretable, and verifiable validation. Using LoRA-based fine-tuning and EHR-aligned modeling, the framework achieves an F1 score of 0.8556; the summarization model attains ROUGE-1 of 0.5797 and BERTScore of 0.9120. It delivers high-precision, trustworthy verification across 3,786 clinical propositions, substantially enhancing output reliability and clinical safety.

Technology Category

Application Category

📝 Abstract

In healthcare, it is essential for any LLM-generated output to be reliable and accurate, particularly in cases involving decision-making and patient safety. However, the outputs are often unreliable in such critical areas due to the risk of hallucinated outputs from the LLMs. To address this issue, we propose a fact-checking module that operates independently of any LLM, along with a domain-specific summarization model designed to minimize hallucination rates. Our model is fine-tuned using Low-Rank Adaptation (LoRa) on the MIMIC III dataset and is paired with the fact-checking module, which uses numerical tests for correctness and logical checks at a granular level through discrete logic in natural language processing (NLP) to validate facts against electronic health records (EHRs). We trained the LLM model on the full MIMIC-III dataset. For evaluation of the fact-checking module, we sampled 104 summaries, extracted them into 3,786 propositions, and used these as facts. The fact-checking module achieves a precision of 0.8904, a recall of 0.8234, and an F1-score of 0.8556. Additionally, the LLM summary model achieves a ROUGE-1 score of 0.5797 and a BERTScore of 0.9120 for summary quality.

Problem

Research questions and friction points this paper is trying to address.

Mitigating hallucinations in healthcare LLMs for reliability

Using granular fact-checking against EHRs to ensure accuracy

Employing domain-specific adaptation to reduce hallucination rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Independent fact-checking module using numerical and logical tests

Domain-specific summarization model fine-tuned with LoRa on MIMIC-III

Granular validation against electronic health records to reduce hallucinations

🔎 Similar Papers

MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models