NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human

📅 2024-06-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Sensitive information leakage poses significant privacy risks in cloud-based large language model (LLM) applications. Method: This paper proposes a novel text rewriting paradigm that jointly optimizes linguistic naturalness and privacy protection. To systematically evaluate and advance research in this direction, we introduce NAP²—the first benchmark explicitly designed to co-optimize naturalness and privacy strength. NAP² employs human–LLM collaborative annotation to simulate two human-inspired anonymization strategies: deletion and abstraction. We further design the first dual-dimensional evaluation framework—integrating human annotation, LLM-synthesized references, and controlled comparative experiments—to assess both privacy preservation and textual fluency. Results: Experiments demonstrate that NAP² significantly outperforms existing anonymization methods across multiple metrics—including BERTScore, privacy leakage rate, and human-rated naturalness—thereby overcoming the traditional trade-off between linguistic fluency and semantic fidelity in text sanitization.

Technology Category

Application Category

📝 Abstract

The widespread use of cloud-based Large Language Models (LLMs) has heightened concerns over user privacy, as sensitive information may be inadvertently exposed during interactions with these services. To protect privacy before sending sensitive data to those models, we suggest sanitizing sensitive text using two common strategies used by humans: i) deleting sensitive expressions, and ii) obscuring sensitive details by abstracting them. To explore the issues and develop a tool for text rewriting, we curate the first corpus, coined NAP^2, through both crowdsourcing and the use of large language models (LLMs). Compared to the prior works on anonymization, the human-inspired approaches result in more natural rewrites and offer an improved balance between privacy protection and data utility, as demonstrated by our extensive experiments. Researchers interested in accessing the dataset are encouraged to contact the first or corresponding author via email.

Problem

Research questions and friction points this paper is trying to address.

Protecting user privacy in cloud-based LLM interactions

Developing natural text rewriting for sensitive data

Balancing privacy protection with data utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sanitizing text by deleting sensitive expressions

Obscuring details via human-inspired abstraction

Creating NAP^2 corpus via crowdsourcing and LLMs

🔎 Similar Papers

Large Language Models are Advanced Anonymizers