Privacy Amplification Through Synthetic Data: Insights from Linear Regression

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work investigates the theoretical mechanisms by which synthetic data amplifies differential privacy (DP) guarantees—specifically, whether and how privacy protection of an original model improves when its outputs are replaced with synthetic data generated by a *hidden* generative model. Method: Focusing on linear regression, we rigorously analyze privacy amplification under two generation paradigms: (i) synthetic data drawn from a distribution over random inputs, and (ii) deterministic, seed-controlled generation of a single synthetic point. We integrate DP theory, linear regression modeling, and tight privacy leakage bounds. Contribution/Results: We establish the first formal proof that finite synthetic datasets generated from random inputs strictly amplify DP beyond the original model’s guarantee, whereas seed-controlled single-point synthesis can cause complete privacy collapse. Our analysis yields dual theoretical boundaries—“positive amplification” and “negative collapse”—and introduces the first scalable framework for privacy amplification analysis applicable to general generative models.

Technology Category

Application Category

📝 Abstract

Synthetic data inherits the differential privacy guarantees of the model used to generate it. Additionally, synthetic data may benefit from privacy amplification when the generative model is kept hidden. While empirical studies suggest this phenomenon, a rigorous theoretical understanding is still lacking. In this paper, we investigate this question through the well-understood framework of linear regression. First, we establish negative results showing that if an adversary controls the seed of the generative model, a single synthetic data point can leak as much information as releasing the model itself. Conversely, we show that when synthetic data is generated from random inputs, releasing a limited number of synthetic data points amplifies privacy beyond the model's inherent guarantees. We believe our findings in linear regression can serve as a foundation for deriving more general bounds in the future.

Problem

Research questions and friction points this paper is trying to address.

Investigates privacy amplification in synthetic data generation

Examines information leakage risks when adversary controls model seed

Demonstrates enhanced privacy with random-input synthetic data release

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic data inherits differential privacy guarantees

Privacy amplification via hidden generative model

Linear regression framework for theoretical insights

🔎 Similar Papers

No similar papers found.