The Inadequacy of Similarity-based Privacy Metrics: Privacy Attacks against"Truly Anonymous"Synthetic Datasets

📅 2023-12-08

📈 Citations: 6

✨ Influential: 0

🤖 AI Summary

This work exposes a fundamental flaw in statistical-similarity-based privacy metrics for synthetic data: they fail to withstand membership inference, attribute inference, and novel reconstruction attacks—even when passing standard privacy audits—leading to severe data leakage. To demonstrate this vulnerability, the authors introduce ReconSyn, the first black-box reconstruction attack that accurately recovers anomalous training samples using only a single generative model and standard similarity metrics. Theoretical analysis further shows that local differential privacy cannot rectify the end-to-end privacy breakdown induced by such similarity-based metrics. Extensive experiments across multiple state-of-the-art synthetic data systems validate the attack’s efficacy, successfully reconstructing 78%–100% of anomalous training records. These results conclusively establish that current statistical-similarity metrics are fundamentally unreliable for privacy assurance.

📝 Abstract

Generative models producing synthetic data are meant to provide a privacy-friendly approach to releasing data. However, their privacy guarantees are only considered robust when models satisfy Differential Privacy (DP). Alas, this is not a ubiquitous standard, as many leading companies (and, in fact, research papers) use ad-hoc privacy metrics based on testing the statistical similarity between synthetic and real data. In this paper, we examine the privacy metrics used in real-world synthetic data deployments and demonstrate their unreliability in several ways. First, we provide counter-examples where severe privacy violations occur even if the privacy tests pass and instantiate accurate membership and attribute inference attacks with minimal cost. We then introduce ReconSyn, a reconstruction attack that generates multiple synthetic datasets that are considered private by the metrics but actually leak information unique to individual records. We show that ReconSyn recovers 78-100% of the outliers in the train data with only black-box access to a single fitted generative model and the privacy metrics. In the process, we show that applying DP only to the model does not mitigate this attack, as using privacy metrics breaks the end-to-end DP pipeline.

Problem

Research questions and friction points this paper is trying to address.

Exposing flaws in similarity-based privacy metrics for synthetic data

Demonstrating privacy attacks on 'anonymous' synthetic datasets

Introducing ReconSyn to exploit vulnerabilities in privacy metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Demonstrates flaws in similarity-based privacy metrics

Introduces ReconSyn reconstruction attack technique

Shows DP alone fails to prevent attacks

🔎 Similar Papers

No similar papers found.

Authors to Follow