🤖 AI Summary
This work addresses the challenge of efficiently generating privacy-preserving synthetic data that faithfully captures high-dimensional real-world distributions. The authors propose a lightweight approach based on fully connected neural networks combined with a randomized loss function, eliminating the need for complex architectures. By mapping Gaussian noise onto the data manifold, the method achieves high fidelity while substantially accelerating generation. Extensive experiments across 25 real-world tabular datasets demonstrate that the proposed technique attains state-of-the-art performance in terms of Maximum Mean Discrepancy (MMD), offers several orders of magnitude speedup in synthesis time, and effectively supports downstream classification tasks—successfully balancing privacy preservation, data fidelity, and computational efficiency.
📝 Abstract
The use of synthetic data in machine learning applications and research offers many benefits, including performance improvements through data augmentation, privacy preservation of original samples, and reliable method assessment with fully synthetic data. This work proposes a time-efficient synthetic data generation method based on a fully connected neural network and a randomized loss function that transforms a random Gaussian distribution to approximate a target real-world dataset. The experiments conducted on 25 diverse tabular real-world datasets confirm that the proposed solution surpasses the state-of-the-art generative methods and achieves reference MMD scores orders of magnitude faster than modern deep learning solutions. The experiments involved analyzing distributional similarity, assessing the impact on classification quality, and using PCA for dimensionality reduction, which further enhances data privacy and can boost classification quality while reducing time and memory complexity.