Robust Utility-Preserving Text Anonymization Based on Large Language Models

📅 2024-07-16

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

172K/year

🤖 AI Summary

To address the challenge of balancing privacy preservation and downstream utility in text anonymization under large language model (LLM)-enhanced re-identification attacks, this paper proposes a privacy-utility co-optimization framework. Our method introduces the first LLM-driven tripartite evaluation paradigm—comprising a privacy evaluator, a utility evaluator, and a joint optimizer—and pioneers the application of direct preference optimization (DPO) for distilling anonymization capabilities into a lightweight, real-time model. By systematically modeling and mitigating the privacy-utility trade-off, our approach ensures robust defense against LLM-powered re-identification while significantly improving practical utility. Experiments demonstrate that, compared to baselines, our method reduces LLM-based re-identification success rates by 38.2% on average and decreases downstream task performance degradation by 22.7%. The framework supports scalable, real-time deployment. Code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract

Text anonymization is crucial for sharing sensitive data while maintaining privacy. Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models (LLMs), which have shown advanced capability in memorizing detailed information and patterns as well as connecting disparate pieces of information. In defending against LLM-based re-identification attacks, anonymization could jeopardize the utility of the resulting anonymized data in downstream tasks -- the trade-off between privacy and data utility requires deeper understanding within the context of LLMs. This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component, which work collaboratively to perform anonymization. To provide a practical model for large-scale and real-time environments, we distill the anonymization capabilities into a lightweight model using Direct Preference Optimization (DPO). Extensive experiments demonstrate that the proposed models outperform baseline models, showing robustness in reducing the risk of re-identification while preserving greater data utility in downstream tasks. Our code and dataset are available at https://github.com/UKPLab/arxiv2024-rupta.

Problem

Research questions and friction points this paper is trying to address.

Defending against LLM-based re-identification of sensitive text

Balancing anonymization and data utility in downstream tasks

Developing robust, utility-preserving anonymization using LLM components

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based privacy evaluator for sensitive data

Utility evaluator maintains downstream task performance

Optimization balances privacy and data utility

🔎 Similar Papers

Large Language Models are Advanced Anonymizers