Paradox of De-identification: A Critique of HIPAA Safe Harbour in the Age of LLMs

πŸ“… 2026-02-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study demonstrates that clinical text de-identified according to the HIPAA Safe Harbor standard remains vulnerable to privacy breaches: modern large language models (LLMs) can exploit implicit associations between diagnoses and quasi-identifiers to re-identify patients with high accuracy and even infer their residential communities. By integrating causal graphical modeling with LLMs, the authors conduct re-identification experiments and diagnostic ablation analyses to systematically uncover re-identification vulnerabilities in de-identified narratives. They introduce the β€œde-identification paradox”—the counterintuitive finding that stringent de-identification may inadvertently heighten privacy risks by preserving semantic structures that facilitate inference. This work challenges conventional privacy-preserving paradigms and reveals fundamental inadequacies in current de-identification standards in the era of advanced artificial intelligence.

Technology Category

Application Category

πŸ“ Abstract
Privacy is a human right that sustains patient-provider trust. Clinical notes capture a patient's private vulnerability and individuality, which are used for care coordination and research. Under HIPAA Safe Harbor, these notes are de-identified to protect patient privacy. However, Safe Harbor was designed for an era of categorical tabular data, focusing on the removal of explicit identifiers while ignoring the latent information found in correlations between identity and quasi-identifiers, which can be captured by modern LLMs. We first formalize these correlations using a causal graph, then validate it empirically through individual re-identification of patients from scrubbed notes. The paradox of de-identification is further shown through a diagnosis ablation: even when all other information is removed, the model can predict the patient's neighborhood based on diagnosis alone. This position paper raises the question of how we can act as a community to uphold patient-provider trust when de-identification is inherently imperfect. We aim to raise awareness and discuss actionable recommendations.
Problem

Research questions and friction points this paper is trying to address.

de-identification
HIPAA Safe Harbor
patient privacy
large language models
re-identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

causal graph
re-identification
LLMs
de-identification
quasi-identifiers
L
Lavender Y. Jiang
New York University, New York, NY, USA
X
Xujin Chris Liu
New York University, New York, NY, USA
Kyunghyun Cho
Kyunghyun Cho
New York University, Genentech
Machine LearningDeep Learning
E
Eric K. Oermann
New York University, New York, NY, USA