Controlled Generation for Private Synthetic Text

πŸ“… 2025-09-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Text anonymization in high-sensitivity domains (e.g., healthcare, law) struggles to simultaneously ensure rigorous privacy protection and high-quality synthetic text generation. Method: We propose a controllable synthetic text generation framework integrating de-identification with the β€œHiding In Plain Sight” (HIPS) principle. Our approach introduces entity-aware control codes, leverages in-context learning (ICL) and prefix tuning, and employs a customized masking strategy alongside a privacy-aware loss function to explicitly model sensitive entity distributions during generation. Contribution/Results: Evaluated on clinical and legal datasets, our method achieves strong k-anonymity and differential privacy guarantees while significantly improving semantic fidelity, logical coherence, and downstream task utility of synthetic texts. It demonstrates scalability and engineering feasibility, offering the first end-to-end solution for sensitive domains that jointly satisfies stringent privacy requirements and high-fidelity text generation.

Technology Category

Application Category

πŸ“ Abstract
Text anonymization is essential for responsibly developing and deploying AI in high-stakes domains such as healthcare, social services, and law. In this work, we propose a novel methodology for privacy-preserving synthetic text generation that leverages the principles of de-identification and the Hiding In Plain Sight (HIPS) theory. Our approach introduces entity-aware control codes to guide controllable generation using either in-context learning (ICL) or prefix tuning. The ICL variant ensures privacy levels consistent with the underlying de-identification system, while the prefix tuning variant incorporates a custom masking strategy and loss function to support scalable, high-quality generation. Experiments on legal and clinical datasets demonstrate that our method achieves a strong balance between privacy protection and utility, offering a practical and effective solution for synthetic text generation in sensitive domains.
Problem

Research questions and friction points this paper is trying to address.

Developing privacy-preserving synthetic text generation for sensitive domains
Balancing privacy protection with utility in AI-generated text
Creating controlled generation methods using entity-aware codes and masking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Entity-aware control codes guide controllable text generation
In-context learning ensures privacy via de-identification system
Prefix tuning variant uses custom masking and loss function
πŸ”Ž Similar Papers
No similar papers found.