NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human

📅 2024-06-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
Sensitive information leakage poses significant privacy risks in cloud-based large language model (LLM) applications. Method: This paper proposes a novel text rewriting paradigm that jointly optimizes linguistic naturalness and privacy protection. To systematically evaluate and advance research in this direction, we introduce NAP²—the first benchmark explicitly designed to co-optimize naturalness and privacy strength. NAP² employs human–LLM collaborative annotation to simulate two human-inspired anonymization strategies: deletion and abstraction. We further design the first dual-dimensional evaluation framework—integrating human annotation, LLM-synthesized references, and controlled comparative experiments—to assess both privacy preservation and textual fluency. Results: Experiments demonstrate that NAP² significantly outperforms existing anonymization methods across multiple metrics—including BERTScore, privacy leakage rate, and human-rated naturalness—thereby overcoming the traditional trade-off between linguistic fluency and semantic fidelity in text sanitization.

Technology Category

Application Category

📝 Abstract
The widespread use of cloud-based Large Language Models (LLMs) has heightened concerns over user privacy, as sensitive information may be inadvertently exposed during interactions with these services. To protect privacy before sending sensitive data to those models, we suggest sanitizing sensitive text using two common strategies used by humans: i) deleting sensitive expressions, and ii) obscuring sensitive details by abstracting them. To explore the issues and develop a tool for text rewriting, we curate the first corpus, coined NAP^2, through both crowdsourcing and the use of large language models (LLMs). Compared to the prior works on anonymization, the human-inspired approaches result in more natural rewrites and offer an improved balance between privacy protection and data utility, as demonstrated by our extensive experiments. Researchers interested in accessing the dataset are encouraged to contact the first or corresponding author via email.
Problem

Research questions and friction points this paper is trying to address.

Protecting user privacy in cloud-based LLM interactions
Developing natural text rewriting for sensitive data
Balancing privacy protection with data utility
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sanitizing text by deleting sensitive expressions
Obscuring details via human-inspired abstraction
Creating NAP^2 corpus via crowdsourcing and LLMs
🔎 Similar Papers
2024-02-21arXiv.orgCitations: 7