Paradox of De-identification: A Critique of HIPAA Safe Harbour in the Age of LLMs

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study demonstrates that clinical text de-identified according to the HIPAA Safe Harbor standard remains vulnerable to privacy breaches: modern large language models (LLMs) can exploit implicit associations between diagnoses and quasi-identifiers to re-identify patients with high accuracy and even infer their residential communities. By integrating causal graphical modeling with LLMs, the authors conduct re-identification experiments and diagnostic ablation analyses to systematically uncover re-identification vulnerabilities in de-identified narratives. They introduce the “de-identification paradox”—the counterintuitive finding that stringent de-identification may inadvertently heighten privacy risks by preserving semantic structures that facilitate inference. This work challenges conventional privacy-preserving paradigms and reveals fundamental inadequacies in current de-identification standards in the era of advanced artificial intelligence.

Technology Category

Application Category

📝 Abstract

Privacy is a human right that sustains patient-provider trust. Clinical notes capture a patient's private vulnerability and individuality, which are used for care coordination and research. Under HIPAA Safe Harbor, these notes are de-identified to protect patient privacy. However, Safe Harbor was designed for an era of categorical tabular data, focusing on the removal of explicit identifiers while ignoring the latent information found in correlations between identity and quasi-identifiers, which can be captured by modern LLMs. We first formalize these correlations using a causal graph, then validate it empirically through individual re-identification of patients from scrubbed notes. The paradox of de-identification is further shown through a diagnosis ablation: even when all other information is removed, the model can predict the patient's neighborhood based on diagnosis alone. This position paper raises the question of how we can act as a community to uphold patient-provider trust when de-identification is inherently imperfect. We aim to raise awareness and discuss actionable recommendations.

Problem

Research questions and friction points this paper is trying to address.

de-identification

HIPAA Safe Harbor

patient privacy

large language models

re-identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal graph

re-identification

LLMs

de-identification

quasi-identifiers

🔎 Similar Papers

Preserving Privacy in Large Language Models: A Survey on Current Threats and Solutions

2024-08-10arXiv.orgCitations: 3

Authors to Follow