🤖 AI Summary
To address structural mismatches and low fidelity in synthetic data generation for complex relational databases—caused by constraints such as primary-foreign key overlaps, absence of explicit primary keys, and inter-table temporal dependencies—this paper proposes the first scalable, end-to-end neural framework supporting relational schema integrity preservation, deep multi-hop relational context modeling, and large-scale synthesis. Our method integrates a custom relational graph neural network, an incremental table-generation mechanism, and a constraint-aware sampling strategy to jointly optimize structural consistency and statistical fidelity. Experiments on three real-world, cross-domain open-source databases demonstrate significant improvements: +32.7% in relational validity, +28.4% in multivariate statistical fidelity, and +21.9% in downstream utility (e.g., SQL query accuracy). The framework establishes a foundation for high-trust synthetic data in testing, data sharing, and machine learning applications.
📝 Abstract
Synthetic data has numerous applications, including but not limited to software testing at scale, privacy-preserving data sharing to enable smoother collaboration between stakeholders, and data augmentation for analytical and machine learning tasks. Relational databases, which are commonly used by corporations, governments, and financial institutions, present unique challenges for synthetic data generation due to their complex structures. Existing synthetic relational database generation approaches often assume idealized scenarios, such as every table having a perfect primary key column without composite and potentially overlapping primary or foreign key constraints, and fail to account for the sequential nature of certain tables. In this paper, we propose incremental relational generator (IRG), that successfully handles these ubiquitous real-life situations. IRG ensures the preservation of relational schema integrity, offers a deep contextual understanding of relationships beyond direct ancestors and descendants, leverages the power of newly designed deep neural networks, and scales efficiently to handle larger datasets--a combination never achieved in previous works. Experiments on three open-source real-life relational datasets in different fields at different scales demonstrate IRG's advantage in maintaining the synthetic data's relational schema validity and data fidelity and utility.