Private Text Generation by Seeding Large Language Model Prompts

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address privacy risks associated with large language model (LLM)-generated synthetic data in sensitive domains such as healthcare, this paper proposes the Differential Privacy Keyword Prompt Seeding (DP-KPS) framework. DP-KPS requires neither fine-tuning nor access to original sensitive data; instead, it injects differentially private keywords into input prompts and leverages phrase-level embedding sampling alongside zero-shot prompt engineering to generate high-fidelity, diverse synthetic text from black-box LLMs. Crucially, it is the first method to achieve rigorous differential privacy guarantees solely through privatized prompts. Evaluated on multiple text classification tasks, synthetic corpora produced by DP-KPS retain 85–92% of downstream model performance achieved on original data—substantially outperforming existing privacy-preserving synthetic data approaches. The framework thus offers strong formal privacy, seamless deployment (no model modification), and high practical utility.

Technology Category

Application Category

📝 Abstract
We explore how private synthetic text can be generated by suitably prompting a large language model (LLM). This addresses a challenge for organizations like hospitals, which hold sensitive text data like patient medical records, and wish to share it in order to train machine learning models for medical tasks, while preserving patient privacy. Methods that rely on training or finetuning a model may be out of reach, either due to API limits of third-party LLMs, or due to ethical and legal prohibitions on sharing the private data with the LLM itself. We propose Differentially Private Keyphrase Prompt Seeding (DP-KPS), a method that generates a private synthetic text corpus from a sensitive input corpus, by accessing an LLM only through privatized prompts. It is based on seeding the prompts with private samples from a distribution over phrase embeddings, thus capturing the input corpus while achieving requisite output diversity and maintaining differential privacy. We evaluate DP-KPS on downstream ML text classification tasks, and show that the corpora it generates preserve much of the predictive power of the original ones. Our findings offer hope that institutions can reap ML insights by privately sharing data with simple prompts and little compute.
Problem

Research questions and friction points this paper is trying to address.

Generate private synthetic text
Preserve patient privacy
Use privatized LLM prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentially Private Keyphrase Prompt Seeding
Generates private synthetic text
Maintains differential privacy
🔎 Similar Papers
No similar papers found.