Controlled Generation for Private Synthetic Text

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Text anonymization in high-sensitivity domains (e.g., healthcare, law) struggles to simultaneously ensure rigorous privacy protection and high-quality synthetic text generation. Method: We propose a controllable synthetic text generation framework integrating de-identification with the “Hiding In Plain Sight” (HIPS) principle. Our approach introduces entity-aware control codes, leverages in-context learning (ICL) and prefix tuning, and employs a customized masking strategy alongside a privacy-aware loss function to explicitly model sensitive entity distributions during generation. Contribution/Results: Evaluated on clinical and legal datasets, our method achieves strong k-anonymity and differential privacy guarantees while significantly improving semantic fidelity, logical coherence, and downstream task utility of synthetic texts. It demonstrates scalability and engineering feasibility, offering the first end-to-end solution for sensitive domains that jointly satisfies stringent privacy requirements and high-fidelity text generation.

Technology Category

Application Category

📝 Abstract

Text anonymization is essential for responsibly developing and deploying AI in high-stakes domains such as healthcare, social services, and law. In this work, we propose a novel methodology for privacy-preserving synthetic text generation that leverages the principles of de-identification and the Hiding In Plain Sight (HIPS) theory. Our approach introduces entity-aware control codes to guide controllable generation using either in-context learning (ICL) or prefix tuning. The ICL variant ensures privacy levels consistent with the underlying de-identification system, while the prefix tuning variant incorporates a custom masking strategy and loss function to support scalable, high-quality generation. Experiments on legal and clinical datasets demonstrate that our method achieves a strong balance between privacy protection and utility, offering a practical and effective solution for synthetic text generation in sensitive domains.

Problem

Research questions and friction points this paper is trying to address.

Developing privacy-preserving synthetic text generation for sensitive domains

Balancing privacy protection with utility in AI-generated text

Creating controlled generation methods using entity-aware codes and masking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Entity-aware control codes guide controllable text generation

In-context learning ensures privacy via de-identification system

Prefix tuning variant uses custom masking and loss function

🔎 Similar Papers

Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data