π€ AI Summary
Text anonymization in high-sensitivity domains (e.g., healthcare, law) struggles to simultaneously ensure rigorous privacy protection and high-quality synthetic text generation.
Method: We propose a controllable synthetic text generation framework integrating de-identification with the βHiding In Plain Sightβ (HIPS) principle. Our approach introduces entity-aware control codes, leverages in-context learning (ICL) and prefix tuning, and employs a customized masking strategy alongside a privacy-aware loss function to explicitly model sensitive entity distributions during generation.
Contribution/Results: Evaluated on clinical and legal datasets, our method achieves strong k-anonymity and differential privacy guarantees while significantly improving semantic fidelity, logical coherence, and downstream task utility of synthetic texts. It demonstrates scalability and engineering feasibility, offering the first end-to-end solution for sensitive domains that jointly satisfies stringent privacy requirements and high-fidelity text generation.
π Abstract
Text anonymization is essential for responsibly developing and deploying AI in high-stakes domains such as healthcare, social services, and law. In this work, we propose a novel methodology for privacy-preserving synthetic text generation that leverages the principles of de-identification and the Hiding In Plain Sight (HIPS) theory. Our approach introduces entity-aware control codes to guide controllable generation using either in-context learning (ICL) or prefix tuning. The ICL variant ensures privacy levels consistent with the underlying de-identification system, while the prefix tuning variant incorporates a custom masking strategy and loss function to support scalable, high-quality generation. Experiments on legal and clinical datasets demonstrate that our method achieves a strong balance between privacy protection and utility, offering a practical and effective solution for synthetic text generation in sensitive domains.