🤖 AI Summary
To address the challenge of generating synthetic tabular data for e-commerce stress testing (e.g., Black Friday) that simultaneously ensures high fidelity, privacy preservation, and database executability, this paper proposes the first GAN-based method explicitly embedding query selectivity constraints into the generative framework. It pretrains a deep neural network to model the selectivity distribution of real data and jointly optimizes it end-to-end with the generator, thereby ensuring high behavioral consistency between synthetic and real data under SQL queries. Experiments on five real-world e-commerce datasets demonstrate that the method improves selectivity estimation accuracy by up to 20% over three state-of-the-art GAN/VAE baselines and enhances machine learning utility by up to 6%, significantly outperforming existing approaches in both fidelity and downstream task performance.
📝 Abstract
As E-commerce platforms face surging transactions during major shopping events like Black Friday, stress testing with synthesized data is crucial for resource planning. Most recent studies use Generative Adversarial Networks (GANs) to generate tabular data while ensuring privacy and machine learning utility. However, these methods overlook the computational demands of processing GAN-generated data, making them unsuitable for E-commerce stress testing. This thesis introduces a novel GAN-based approach incorporating query selectivity constraints, a key factor in database transaction processing. We integrate a pre-trained deep neural network to maintain selectivity consistency between real and synthetic data. Our method, tested on five real-world datasets, outperforms three state-of-the-art GANs and a VAE model, improving selectivity estimation accuracy by up to 20pct and machine learning utility by up to 6 pct.