Is API Access to LLMs Useful for Generating Private Synthetic Tabular Data?

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work investigates the feasibility and efficacy of generating differentially private (DP) synthetic tabular data solely via API calls to large language models (LLMs), without access to model weights. To address this black-box setting, we propose two novel DP synthesis algorithms: (1) the first adaptation of the Private Evolution framework to tabular domains, incorporating a query-workload-aware distance metric; and (2) a single-API-call paradigm that eliminates adaptive query overhead. Experimental evaluation across multiple benchmarks shows that current LLM API–based approaches do not consistently outperform classical DP mechanisms, exposing inherent limitations of LLMs in structured data generation. Our contributions include a new paradigm for integrating DP synthetic data generation with LLM APIs, a reproducible baseline implementation, and concrete directions for future optimization—advancing both DP methodology and responsible LLM deployment for sensitive data synthesis.

Technology Category

Application Category

📝 Abstract

Differentially private (DP) synthetic data is a versatile tool for enabling the analysis of private data. Recent advancements in large language models (LLMs) have inspired a number of algorithm techniques for improving DP synthetic data generation. One family of approaches uses DP finetuning on the foundation model weights; however, the model weights for state-of-the-art models may not be public. In this work we propose two DP synthetic tabular data algorithms that only require API access to the foundation model. We adapt the Private Evolution algorithm (Lin et al., 2023; Xie et al., 2024) -- which was designed for image and text data -- to the tabular data domain. In our extension of Private Evolution, we define a query workload-based distance measure, which may be of independent interest. We propose a family of algorithms that use one-shot API access to LLMs, rather than adaptive queries to the LLM. Our findings reveal that API-access to powerful LLMs does not always improve the quality of DP synthetic data compared to established baselines that operate without such access. We provide insights into the underlying reasons and propose improvements to LLMs that could make them more effective for this application.

Problem

Research questions and friction points this paper is trying to address.

API access to LLMs

Differentially private synthetic data

Tabular data generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

API-access algorithms for DP data

Private Evolution adapted for tabular

One-shot API access to LLMs

🔎 Similar Papers

A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models