Not What the Doctor Ordered: Surveying LLM-based De-identification and Quantifying Clinical Information Loss

๐Ÿ“… 2025-09-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study identifies three critical issues in current LLM-driven medical text de-identification: (1) lack of standardized evaluation metrics; (2) inadequacy of conventional classification metrics (e.g., F1-score) in detecting clinically harmful over-redactionโ€”such as erroneous removal of essential diagnostic or medication information; and (3) absence of human expert validation for novel error types. To address these, we propose a hybrid evaluation framework integrating automated error detection with clinical expert review, systematically benchmarking leading LLMs on standard medical de-identification datasets. We introduce a fine-grained, clinically grounded error annotation schema and quantitative metrics explicitly designed to assess clinical information preservation. Results reveal pervasive over-redaction across state-of-the-art methods, with conventional automatic metrics substantially overestimating performance. Clinical expert validation confirms severe limitations in real-world clinical utility. This work establishes a new evaluation paradigm for medical de-identification that rigorously balances privacy protection and clinical usability.

Technology Category

Application Category

๐Ÿ“ Abstract
De-identification in the healthcare setting is an application of NLP where automated algorithms are used to remove personally identifying information of patients (and, sometimes, providers). With the recent rise of generative large language models (LLMs), there has been a corresponding rise in the number of papers that apply LLMs to de-identification. Although these approaches often report near-perfect results, significant challenges concerning reproducibility and utility of the research papers persist. This paper identifies three key limitations in the current literature: inconsistent reporting metrics hindering direct comparisons, the inadequacy of traditional classification metrics in capturing errors which LLMs may be more prone to (i.e., altering clinically relevant information), and lack of manual validation of automated metrics which aim to quantify these errors. To address these issues, we first present a survey of LLM-based de-identification research, highlighting the heterogeneity in reporting standards. Second, we evaluated a diverse set of models to quantify the extent of inappropriate removal of clinical information. Next, we conduct a manual validation of an existing evaluation metric to measure the removal of clinical information, employing clinical experts to assess their efficacy. We highlight poor performance and describe the inherent limitations of such metrics in identifying clinically significant changes. Lastly, we propose a novel methodology for the detection of clinically relevant information removal.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-based de-identification's clinical information loss
Addressing inconsistent metrics and reproducibility challenges
Quantifying inappropriate removal of clinically relevant data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Surveyed LLM-based de-identification research heterogeneity
Conducted manual validation with clinical experts
Proposed novel methodology detecting clinical information removal
๐Ÿ”Ž Similar Papers
No similar papers found.
K
Kiana Aghakasiri
Department of Computing Science, University of Alberta
N
Noopur Zambare
Department of Computing Science, University of Alberta
J
JoAnn Thai
Department of Medicine, University of Alberta
C
Carrie Ye
Department of Medicine, University of Alberta
M
Mayur Mehta
Department of Medicine, University of Alberta
J
J. Ross Mitchell
Department of Computing Science, University of Alberta
Mohamed Abdalla
Mohamed Abdalla
University of Alberta