🤖 AI Summary
This study addresses pervasive data leakage and contamination issues in current evaluations of personally identifiable information (PII) anonymization techniques, which lead to a significant overestimation of privacy protection efficacy. We systematically analyze methodological flaws in existing assessment practices and, for the first time, expose data contamination introduced by the use of non-real private data in experimental designs. Our critical evaluation, combined with data leakage pathway analysis and adversarial scenario modeling, demonstrates that most reported attack successes rely on unrealistic data assumptions. We argue that only evaluations grounded in authentic private data can reliably validate the security of anonymization methods. This work thus establishes a theoretical foundation and practical direction for developing trustworthy, reproducible frameworks for privacy-preserving technology assessment.
📝 Abstract
Removing personally identifiable information (PII) from texts is necessary to comply with various data protection regulations and to enable data sharing without compromising privacy. However, recent works show that documents sanitized by PII removal techniques are vulnerable to reconstruction attacks. Yet, we suspect that the reported success of these attacks is largely overestimated. We critically analyze the evaluation of existing attacks and find that data leakage and data contamination are not properly mitigated, leaving the question whether or not PII removal techniques truly protect privacy in real-world scenarios unaddressed. We investigate possible data sources and attack setups that avoid data leakage and conclude that only truly private data can allow us to objectively evaluate vulnerabilities in PII removal techniques. However, access to private data is heavily restricted - and for good reasons - which also means that the public research community cannot address this problem in a transparent, reproducible, and trustworthy manner.