Systematic Assessment of Tabular Data Synthesis Algorithms

📅 2024-02-09

📈 Citations: 6

✨ Influential: 1

career value

178K/year

🤖 AI Summary

Existing tabular data synthesis methods lack a unified, comparable evaluation framework due to fragmented metrics and missing standardized benchmarks—particularly concerning privacy guarantees (differential privacy vs. heuristic approaches) and model paradigms (diffusion models, LLMs vs. statistical methods). Method: We introduce the first systematic evaluation framework for privacy-preserving tabular synthesis, featuring a three-dimensional quantitative metric system—fidelity, privacy, and utility—and a differentiable, unified objective function enabling fair cross-paradigm comparison across diffusion models, LLM-based synthesizers, and marginal-distribution methods. The framework integrates formal differential privacy verification, multi-scale statistical utility assessment (e.g., MMD, JS divergence), adversarial privacy attack benchmarks, and downstream task generalization tests. Results: Extensive experiments across 12 real-world datasets and 8 synthesizers reveal fundamental performance boundaries and trade-off patterns, providing empirical guidance and concrete improvement pathways for next-generation privacy-enhanced synthetic data generation.

Technology Category

Application Category

📝 Abstract

Data synthesis has been advocated as an important approach for utilizing data while protecting data privacy. A large number of tabular data synthesis algorithms (which we call synthesizers) have been proposed. Some synthesizers satisfy Differential Privacy, while others aim to provide privacy in a heuristic fashion. A comprehensive understanding of the strengths and weaknesses of these synthesizers remains elusive due to drawbacks in evaluation metrics and missing head-to-head comparisons of newly developed synthesizers that take advantage of diffusion models and large language models with state-of-the-art marginal-based synthesizers. In this paper, we present a systematic evaluation framework for assessing tabular data synthesis algorithms. Specifically, we examine and critique existing evaluation metrics, and introduce a set of new metrics in terms of fidelity, privacy, and utility to address their limitations. Based on the proposed metrics, we also devise a unified objective for tuning, which can consistently improve the quality of synthetic data for all methods. We conducted extensive evaluations of 8 different types of synthesizers on 12 real-world datasets and identified some interesting findings, which offer new directions for privacy-preserving data synthesis.

Problem

Research questions and friction points this paper is trying to address.

Evaluating privacy and utility of tabular data synthesizers

Addressing limitations in current synthesis evaluation metrics

Comparing diffusion/LLM-based synthesizers with statistical methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic evaluation framework for tabular synthesis

New fidelity privacy utility metrics introduced

Extensive evaluation of 8 synthesizers on datasets

🔎 Similar Papers

CTSyn: A Foundational Model for Cross Tabular Data Generation