๐ค AI Summary
This work addresses the scarcity of publicly available multi-table relational databases for training Relational Foundation Models (RFMs) due to privacy constraints. To overcome this limitation, the authors propose PluRel, a novel framework that, for the first time, models database schemas as directed graphs and captures primaryโforeign key relationships via bipartite graphs, integrating a conditional causal mechanism to generate structurally coherent and diverse synthetic multi-table data. The approach is lightweight and highly efficient, enabling RFM pretraining loss to scale with a power-law dependence on both the number of databases and total token count. Experimental results demonstrate that models pretrained on this synthetic data exhibit significantly improved generalization on real-world databases, establishing PluRel as a scalable and effective training paradigm for RFMs.
๐ Abstract
Relational Foundation Models (RFMs) facilitate data-driven decision-making by learning from complex multi-table databases. However, the diverse relational databases needed to train such models are rarely public due to privacy constraints. While there are methods to generate synthetic tabular data of arbitrary size, incorporating schema structure and primary--foreign key connectivity for multi-table generation remains challenging. Here we introduce PluRel, a framework to synthesize multi-tabular relational databases from scratch. In a step-by-step fashion, PluRel models (1) schemas with directed graphs, (2) inter-table primary-foreign key connectivity with bipartite graphs, and, (3) feature distributions in tables via conditional causal mechanisms. The design space across these stages supports the synthesis of a wide range of diverse databases, while being computationally lightweight. Using PluRel, we observe for the first time that (1) RFM pretraining loss exhibits power-law scaling with the number of synthetic databases and total pretraining tokens, (2) scaling the number of synthetic databases improves generalization to real databases, and (3) synthetic pretraining yields strong base models for continued pretraining on real databases. Overall, our framework and results position synthetic data scaling as a promising paradigm for RFMs.