The DCR Delusion: Measuring the Privacy Risk of Synthetic Data

📅 2025-05-02

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Distance-to-Closest-Record (DCR) and related distance-based proxy metrics systematically fail in privacy assessment of synthetic data, exhibiting no statistical correlation with actual membership inference attack (MIA) risk and thus failing to reflect real privacy leakage. Method: We conduct the first comprehensive empirical evaluation across diverse generative models (Baynet, CTGAN, diffusion models), multiple benchmark datasets, and extensive hyperparameter configurations to assess DCR’s robustness and reliability. Contribution/Results: Our experiments reveal that synthetic datasets passing DCR thresholds exhibit MIA leakage rates exceeding 90%, demonstrating its severe unreliability. We identify the root cause: DCR fundamentally conflates data similarity with privacy—a conceptual flaw rendering both binary and continuous variants incapable of capturing true adversarial risk. This work advocates a paradigm shift toward empirically grounded, end-to-end privacy evaluation based on realistic attacks, with critical implications for regulatory compliance and legal claims of anonymization.

Technology Category

Application Category

📝 Abstract

Synthetic data has become an increasingly popular way to share data without revealing sensitive information. Though Membership Inference Attacks (MIAs) are widely considered the gold standard for empirically assessing the privacy of a synthetic dataset, practitioners and researchers often rely on simpler proxy metrics such as Distance to Closest Record (DCR). These metrics estimate privacy by measuring the similarity between the training data and generated synthetic data. This similarity is also compared against that between the training data and a disjoint holdout set of real records to construct a binary privacy test. If the synthetic data is not more similar to the training data than the holdout set is, it passes the test and is considered private. In this work we show that, while computationally inexpensive, DCR and other distance-based metrics fail to identify privacy leakage. Across multiple datasets and both classical models such as Baynet and CTGAN and more recent diffusion models, we show that datasets deemed private by proxy metrics are highly vulnerable to MIAs. We similarly find both the binary privacy test and the continuous measure based on these metrics to be uninformative of actual membership inference risk. We further show that these failures are consistent across different metric hyperparameter settings and record selection methods. Finally, we argue DCR and other distance-based metrics to be flawed by design and show a example of a simple leakage they miss in practice. With this work, we hope to motivate practitioners to move away from proxy metrics to MIAs as the rigorous, comprehensive standard of evaluating privacy of synthetic data, in particular to make claims of datasets being legally anonymous.

Problem

Research questions and friction points this paper is trying to address.

Evaluating privacy risks of synthetic data using Distance to Closest Record (DCR)

Demonstrating DCR fails to detect privacy leakage in synthetic datasets

Advocating for Membership Inference Attacks (MIAs) as a better privacy metric

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distance-based metrics fail to detect privacy leaks

Proxy metrics are unreliable for privacy assessment

Advocates using Membership Inference Attacks for privacy evaluation

🔎 Similar Papers

No similar papers found.