Towards Foundation Models for Relational Databases with Language Models and Graph Neural Networks

📅 2026-05-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
This work addresses the limitations of traditional deep learning approaches for relational databases, which rely on manual feature engineering and often fail to preserve critical relational structures, as well as the restricted generalization capability of existing relational deep learning models. To overcome these challenges, the authors propose a lightweight hybrid architecture that effectively integrates a fine-tuned BART encoder with a GraphSAGE graph neural network for the first time: the BART component captures intra-row semantic information, while GraphSAGE propagates representations over an entity-relation graph to inject structural context. By jointly modeling semantics and relational dependencies, the method achieves a ROC-AUC of 67.40 on the driver-dnf task in RelBench—approaching the performance of LightGBM (68.86) and specialized relational deep learning approaches (72.62)—demonstrating its effectiveness and strong generalization potential.
📝 Abstract
Relational databases store much of the world's structured information, and they are essential for driving complex predictive applications. However, deep learning progress on relational data remains limited, as conventional approaches flatten databases into single tables via manual feature engineering, discarding relational context. Relational deep learning (RDL) addresses this by modeling databases as relational entity graphs (REGs) for graph neural networks (GNNs), but remains task- and database-specific. To combine the strengths of both paradigms, we propose a hybrid architecture combining a fine-tuned BART encoder to capture intra-row semantics with a GraphSAGE-based GNN over REGs to inject relational context. Experiments on RelBench show that the GNN substantially enriches BART's row embeddings, achieving a ROC-AUC of 67.40 on the driver-dnf task from the rel-f1 dataset. This performance is competitive with supervised baselines such as LightGBM (68.86) and narrows the gap to RDL (72.62) to within 5.22 points, though a substantial gap remains to state-of-the-art foundation models such as KumoRFM (82.63). These results suggest that lightweight hybrid LM-GNN architectures offer a promising and resource-efficient path towards foundation models for relational databases.
Problem

Research questions and friction points this paper is trying to address.

relational databases
foundation models
relational deep learning
graph neural networks
language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

hybrid LM-GNN architecture
relational entity graphs
foundation models for relational databases
GraphSAGE
BART encoder