DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

📅 2024-12-03
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges in differentially private (DP) table data generation using large language models (LLMs)—including textual incoherence, inefficient structural modeling, and suboptimal privacy budget allocation—this paper proposes a two-stage DP fine-tuning framework. In Stage I, non-private pre-adaptation is performed on synthetic (pseudo) data to decouple table structure modeling from content generation. In Stage II, DP-SGD is applied exclusively to real private data, focusing solely on content refinement and conserving privacy budget. The method integrates GPT-2-scale LLMs, table-to-text serialization encoding, and transfer learning. Evaluated across multiple benchmark datasets, the synthesized data achieves significantly higher statistical fidelity, improved downstream machine learning utility, and stronger formal privacy guarantees compared to end-to-end DP fine-tuning baselines. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Generating tabular data under differential privacy (DP) protection ensures theoretical privacy guarantees but poses challenges for training machine learning models, primarily due to the need to capture complex structures under noisy supervision signals. Recently, pre-trained Large Language Models (LLMs) -- even those at the scale of GPT-2 -- have demonstrated great potential in synthesizing tabular data. However, their applications under DP constraints remain largely unexplored. In this work, we address this gap by applying DP techniques to the generation of synthetic tabular data. Our findings shows that LLMs face difficulties in generating coherent text when fine-tuned with DP, as privacy budgets are inefficiently allocated to non-private elements like table structures. To overcome this, we propose DP-2Stage, a two-stage fine-tuning framework for differentially private tabular data generation. The first stage involves non-private fine-tuning on a pseudo dataset, followed by DP fine-tuning on a private dataset. Our empirical results show that this approach improves performance across various settings and metrics compared to directly fine-tuned LLMs in DP contexts. We release our code and setup at https://github.com/tejuafonja/DP-2Stage.
Problem

Research questions and friction points this paper is trying to address.

Generating DP-protected tabular data with LLMs
Inefficient privacy budget allocation in DP fine-tuning
Improving DP tabular data generation via two-stage fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage fine-tuning for DP tabular data
Non-private then DP fine-tuning stages
Improves LLM performance under DP constraints
🔎 Similar Papers
No similar papers found.