🤖 AI Summary
This work addresses the reliability of open-source large language models (LLMs) in extracting critical information—such as admission diagnosis, significant in-hospital events, and follow-up recommendations—from clinical discharge summaries. We propose the first fine-grained hallucination classification and attribution evaluation framework specifically designed for clinical summarization. Our methodology integrates BERTScore/ROUGE metrics, customized event-matching rules, double-blind human annotation, adversarial prompting, and uncertainty calibration. Empirical evaluation reveals that mainstream open-source LLMs exhibit hallucination rates of 31–67%, predominantly manifesting as structural omissions, temporal misalignment, and fabricated diagnoses. To mitigate these issues, we introduce a lightweight post-processing module that operates without modifying the underlying model architecture. This intervention improves F1 score for key event extraction by 12.4%, substantially enhancing the clinical credibility and factual consistency of generated summaries.
📝 Abstract
Clinical summarization is crucial in healthcare as it distills complex medical data into digestible information, enhancing patient understanding and care management. Large language models (LLMs) have shown significant potential in automating and improving the accuracy of such summarizations due to their advanced natural language understanding capabilities. These models are particularly applicable in the context of summarizing medical/clinical texts, where precise and concise information transfer is essential. In this paper, we investigate the effectiveness of open-source LLMs in extracting key events from discharge reports, such as reasons for hospital admission, significant in-hospital events, and critical follow-up actions. In addition, we also assess the prevalence of various types of hallucinations in the summaries produced by these models. Detecting hallucinations is vital as it directly influences the reliability of the information, potentially affecting patient care and treatment outcomes. We conduct comprehensive numerical simulations to rigorously evaluate the performance of these models, further probing the accuracy and fidelity of the extracted content in clinical summarization.