Re-identification of De-identified Documents with Autoregressive Infilling

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study investigates the robustness deficiencies of de-identified documents, specifically examining their vulnerability to re-identification attacks. To reconstruct personally identifiable information (PII) masked in text, we propose a retrieval-augmented generation (RAG)-inspired two-stage method: first, retrieving contextually relevant passages from external knowledge bases (e.g., Wikipedia, legal corpora, clinical texts) using DPR or ColBERT; second, iteratively refining mask filling via autoregressive language models (e.g., LLaMA, Phi) in a context-aware, multi-turn inference process. To our knowledge, this is the first systematic application of RAG to de-identification reversal attacks and the first to incorporate knowledge-driven iterative reasoning for PII recovery. Evaluated on real-world biographical, judicial, and clinical datasets, our method achieves up to 80% accuracy in recovering masked PII spans; performance scales consistently with knowledge base size, exposing critical security gaps in current de-identification practices.

Technology Category

Application Category

📝 Abstract

Documents revealing sensitive information about individuals must typically be de-identified. This de-identification is often done by masking all mentions of personally identifiable information (PII), thereby making it more difficult to uncover the identity of the person(s) in question. To investigate the robustness of de-identification methods, we present a novel, RAG-inspired approach that attempts the reverse process of re-identification based on a database of documents representing background knowledge. Given a text in which personal identifiers have been masked, the re-identification proceeds in two steps. A retriever first selects from the background knowledge passages deemed relevant for the re-identification. Those passages are then provided to an infilling model which seeks to infer the original content of each text span. This process is repeated until all masked spans are replaced. We evaluate the re-identification on three datasets (Wikipedia biographies, court rulings and clinical notes). Results show that (1) as many as 80% of de-identified text spans can be successfully recovered and (2) the re-identification accuracy increases along with the level of background knowledge.

Problem

Research questions and friction points this paper is trying to address.

Assessing robustness of document de-identification methods

Re-identifying masked personal information using background knowledge

Evaluating recovery accuracy of de-identified text spans

Innovation

Methods, ideas, or system contributions that make the work stand out.

RAG-inspired approach for re-identification

Retriever selects relevant background passages

Autoregressive infilling model recovers masked spans

🔎 Similar Papers

Learnable Prompt as Pseudo-Imputation: Rethinking the Necessity of Traditional EHR Data Imputation in Downstream Clinical Prediction