Synthetic Tabular Data: Methods, Attacks and Defenses

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of generating high-quality, privacy-preserving synthetic tabular data. Methodologically, it conducts a systematic review of two dominant paradigms—probabilistic graphical models (e.g., Bayesian networks) and deep learning approaches (e.g., GANs, VAEs, diffusion models)—and introduces, for the first time, a closed-loop “generation–evaluation–attack-defense” framework to rigorously characterize the privacy-utility trade-off. It proposes a differentially private multi-model generation pipeline, quantifies synthesis quality and privacy leakage via information-theoretic metrics and statistical tests, and designs a structured taxonomy to empirically evaluate membership inference and record reconstruction attacks on healthcare and financial datasets. Key contributions include: (1) establishing a verifiable benchmark for synthetic data utility and privacy; (2) identifying the practicality–vulnerability equilibrium points across mainstream models; and (3) distilling a curated list of open challenges to advance trustworthy synthetic data research.

Technology Category

Application Category

📝 Abstract
Synthetic data is often positioned as a solution to replace sensitive fixed-size datasets with a source of unlimited matching data, freed from privacy concerns. There has been much progress in synthetic data generation over the last decade, leveraging corresponding advances in machine learning and data analytics. In this survey, we cover the key developments and the main concepts in tabular synthetic data generation, including paradigms based on probabilistic graphical models and on deep learning. We provide background and motivation, before giving a technical deep-dive into the methodologies. We also address the limitations of synthetic data, by studying attacks that seek to retrieve information about the original sensitive data. Finally, we present extensions and open problems in this area.
Problem

Research questions and friction points this paper is trying to address.

Surveying methods for generating synthetic tabular data
Exploring attacks on synthetic data privacy
Addressing limitations and open problems in synthetic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging machine learning for synthetic data
Using probabilistic graphical models
Deep learning for tabular data
🔎 Similar Papers
No similar papers found.