Protecting De-identified Documents from Search-based Linkage Attacks

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

To mitigate linkage attacks exploiting N-gram–based search on de-identified documents, this paper proposes a privacy-preserving defense that jointly optimizes semantic fidelity and anonymity. First, an N-gram inverted index identifies high-frequency sensitive phrases; then, leveraging k-anonymity principles, a large language model (LLM) iteratively rewrites these sensitive segments with semantically equivalent alternatives. This work is the first to synergistically integrate inverted indexing with LLM-driven semantic rewriting for linkage attack prevention—effectively disrupting exact phrase-matching traceability at its source. Experiments on a real-world judicial case dataset demonstrate that the method reduces linkage success rates by 72.4%, while preserving semantic accuracy (BLEU ≥ 0.89) and readability (human evaluation score: 4.6/5.0).

Technology Category

Application Category

📝 Abstract

While de-identification models can help conceal the identity of the individual(s) mentioned in a document, they fail to address linkage risks, defined as the potential to map the de-identified text back to its source. One straightforward way to perform such linkages is to extract phrases from the de-identified document and then check their presence in the original dataset. This paper presents a method to counter search-based linkage attacks while preserving the semantic integrity of the text. The method proceeds in two steps. We first construct an inverted index of the N-grams occurring in the document collection, making it possible to efficiently determine which N-grams appear in less than $k$ documents (either alone or in combination with other N-grams). An LLM-based rewriter is then iteratively queried to reformulate those spans until linkage is no longer possible. Experimental results on a collection of court cases show that the method is able to effectively prevent search-based linkages while remaining faithful to the original content.

Problem

Research questions and friction points this paper is trying to address.

Preventing re-identification through search queries on de-identified documents

Protecting sensitive text from N-gram based linkage attacks

Preserving semantic integrity while eliminating document traceability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs inverted index of rare N-grams

Uses LLM-based iterative rewriting method

Prevents search attacks while preserving semantics

🔎 Similar Papers

No similar papers found.