Synthesizing Tabular Data Using Selectivity Enhanced Generative Adversarial Networks

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

To address the challenge of generating synthetic tabular data for e-commerce stress testing (e.g., Black Friday) that simultaneously ensures high fidelity, privacy preservation, and database executability, this paper proposes the first GAN-based method explicitly embedding query selectivity constraints into the generative framework. It pretrains a deep neural network to model the selectivity distribution of real data and jointly optimizes it end-to-end with the generator, thereby ensuring high behavioral consistency between synthetic and real data under SQL queries. Experiments on five real-world e-commerce datasets demonstrate that the method improves selectivity estimation accuracy by up to 20% over three state-of-the-art GAN/VAE baselines and enhances machine learning utility by up to 6%, significantly outperforming existing approaches in both fidelity and downstream task performance.

Technology Category

Application Category

📝 Abstract

As E-commerce platforms face surging transactions during major shopping events like Black Friday, stress testing with synthesized data is crucial for resource planning. Most recent studies use Generative Adversarial Networks (GANs) to generate tabular data while ensuring privacy and machine learning utility. However, these methods overlook the computational demands of processing GAN-generated data, making them unsuitable for E-commerce stress testing. This thesis introduces a novel GAN-based approach incorporating query selectivity constraints, a key factor in database transaction processing. We integrate a pre-trained deep neural network to maintain selectivity consistency between real and synthetic data. Our method, tested on five real-world datasets, outperforms three state-of-the-art GANs and a VAE model, improving selectivity estimation accuracy by up to 20pct and machine learning utility by up to 6 pct.

Problem

Research questions and friction points this paper is trying to address.

Addresses computational demands of GAN-generated data for E-commerce stress testing.

Improves selectivity estimation accuracy in synthetic tabular data generation.

Enhances machine learning utility while maintaining privacy in data synthesis.

Innovation

Methods, ideas, or system contributions that make the work stand out.

GANs with query selectivity constraints

Pre-trained deep neural network integration

Improved selectivity and ML utility

🔎 Similar Papers

CTSyn: A Foundational Model for Cross Tabular Data Generation