Synthetic Tabular Data Generation: A Comparative Survey for Modern Techniques

📅 2025-07-15

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Facing increasingly stringent privacy regulations and restricted access to real-world data, this paper systematically surveys tabular data synthesis techniques for high-stakes domains such as finance and healthcare, focusing on three core challenges: privacy preservation, statistical fidelity, and modeling of complex variable dependencies. Methodologically, we propose a generation-objective-driven taxonomy that categorizes approaches by downstream task adaptability, differential privacy guarantees, and data utility trade-offs; emphasize conditional generation and risk-aware modeling to bridge the theory–practice gap; and unify generative models—including GANs, VAEs, and diffusion models—with differential privacy mechanisms within a cohesive evaluation framework integrating statistical metrics and privacy quantification tools. Our contributions include a reproducible benchmark covering state-of-the-art methods and a practical technology selection guide, enabling secure deployment of high-fidelity, privacy-preserving synthetic data in sensitive applications.

Technology Category

Application Category

📝 Abstract

As privacy regulations become more stringent and access to real-world data becomes increasingly constrained, synthetic data generation has emerged as a vital solution, especially for tabular datasets, which are central to domains like finance, healthcare and the social sciences. This survey presents a comprehensive and focused review of recent advances in synthetic tabular data generation, emphasizing methods that preserve complex feature relationships, maintain statistical fidelity, and satisfy privacy requirements. A key contribution of this work is the introduction of a novel taxonomy based on practical generation objectives, including intended downstream applications, privacy guarantees, and data utility, directly informing methodological design and evaluation strategies. Therefore, this review prioritizes the actionable goals that drive synthetic data creation, including conditional generation and risk-sensitive modeling. Additionally, the survey proposes a benchmark framework to align technical innovation with real-world demands. By bridging theoretical foundations with practical deployment, this work serves as both a roadmap for future research and a guide for implementing synthetic tabular data in privacy-critical environments.

Problem

Research questions and friction points this paper is trying to address.

Survey modern techniques for synthetic tabular data generation

Preserve feature relationships and statistical fidelity in synthetic data

Address privacy and utility in synthetic data for critical domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Taxonomy based on practical generation objectives

Benchmark framework aligning tech with demands

Methods preserving feature relationships and privacy

🔎 Similar Papers

TAEGAN: Generating Synthetic Tabular Data For Data Augmentation