Relational In-Context Learning via Synthetic Pre-training with Structural Prior

📅 2026-03-04

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Relational databases lack foundational models due to the privacy, scarcity, and structural heterogeneity of real-world data. This work proposes the first relational database foundation model trained entirely on synthetic data. It introduces a structural causal model–driven relational prior generator to construct large-scale, diverse synthetic databases from scratch, combined with DFS-based linearization and a lightweight neural architecture to pretrain the RDB-PFN model. The approach enables genuine in-context learning and significantly outperforms graph-based and single-table foundation model baselines across 19 real-world relational prediction tasks, while maintaining efficient inference and a compact model structure.

Technology Category

Application Category

📝 Abstract

Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A key obstacle is that high-quality RDBs are private, scarce and structurally heterogeneous, making internet-scale pre-training infeasible. To overcome this data scarcity, We introduce $\textbf{RDB-PFN}$, the first relational foundation model trained purely via $\textbf{synthetic data}$. Inspired by Prior-Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, we design a $\textbf{Relational Prior Generator}$ to create an infinite stream of diverse RDBs from scratch. Pre-training on $\textbf{over 2 million}$ synthetic single-table and relational tasks, RDB-PFN learns to adapt to any new database instantly via genuine $\textbf{in-context learning}$. Experiments verify RDB-PFN achieves strong few-shot performance on 19 real-world relational prediction tasks, outperforming graph-based and single-table foundation-model baselines (given the same DFS-linearized inputs), while using a lightweight architecture and fast inference. The code is available at https://github.com/MuLabPKU/RDBPFN

Problem

Research questions and friction points this paper is trying to address.

Relational Databases

Foundation Models

Data Scarcity

Structural Heterogeneity

In-Context Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Relational Foundation Model

Synthetic Data Pre-training

In-Context Learning