Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs

📅 2025-03-16

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Generating high-fidelity private text under stringent computational constraints while satisfying rigorous differential privacy (DP) remains challenging, especially when leveraging large language models (LLMs). Method: We propose CTCL—a lightweight, DP-compliant text synthesis framework that avoids fine-tuning billion-scale LLMs and eliminates handcrafted prompts. CTCL introduces a novel 140M-parameter conditional generator jointly optimized with a DP-aware clustering-based topic model, decoupling fine-grained semantic modeling from coarse-grained distribution learning. It adopts a two-stage paradigm: public-data pretraining followed by private-data adaptation, augmented with DP histogram estimation. Results: Evaluated across five domains, CTCL achieves state-of-the-art utility under ε ≤ 2, significantly outperforming prompt-based and DP-fine-tuning baselines. Ablation studies confirm the indispensability of each component. CTCL delivers strong privacy guarantees (ε-DP), high computational efficiency, and scalability—enabling practical private text generation without sacrificing fidelity.

Technology Category

Application Category

📝 Abstract

Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution, depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.

Problem

Research questions and friction points this paper is trying to address.

Generate privacy-preserving synthetic text data efficiently

Avoid extensive prompt engineering and billion-scale LLM finetuning

Ensure strong privacy while maintaining data utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight 140M conditional generator pretrained

DP finetuning for fine-grained textual information

Clustering-based topic model extracts DP histogram

🔎 Similar Papers

Generated Data with Fake Privacy: Hidden Dangers of Fine-tuning Large Language Models on Generated Data