🤖 AI Summary
Relational data poses significant challenges for zero-shot transfer across datasets and tasks due to heterogeneous schemas, graph-structured dependencies, and complex functional dependencies. To address this, we propose the Relational Transformer (RT), the first architecture to introduce *Relational Attention*—a novel mechanism that jointly models columns, rows, and primary–foreign key relationships, augmented by table- and column-level metadata for cell-level tokenization. RT performs autoregressive pretraining via masked language modeling on serialized relational database representations. Experiments demonstrate that, under multi-task zero-shot evaluation, RT achieves 94% of the AUROC attained by fully supervised models (with only 22M parameters), substantially outperforming a 27B-parameter LLM (84%). After fine-tuning, RT attains state-of-the-art performance with high sample efficiency. Our core contributions are: (1) the first general-purpose architecture enabling zero-shot transfer for relational data, and (2) Relational Attention—a principled mechanism unifying structural and semantic relational cues.
📝 Abstract
Pretrained transformers readily adapt to new sequence modeling tasks via zero-shot prompting, but relational domains still lack architectures that transfer across datasets and tasks. The core challenge is the diversity of relational data, with varying heterogeneous schemas, graph structures and functional dependencies. In this paper, we present the Relational Transformer (RT) architecture, which can be pretrained on diverse relational databases and directly applied to unseen datasets and tasks without task- or dataset-specific fine-tuning, or retrieval of in-context examples. RT (i) tokenizes cells with table/column metadata, (ii) is pretrained via masked token prediction, and (iii) utilizes a novel extit{Relational Attention} mechanism over columns, rows, and primary-foreign key links. Pretrained on RelBench datasets spanning tasks such as churn and sales forecasting, RT attains strong zero-shot performance, averaging 94% of fully supervised AUROC on binary classification tasks with a single forward pass of a 22M parameter model, as opposed to 84% for a 27B LLM. Fine-tuning yields state-of-the-art results with high sample efficiency. Our experiments show that RT's zero-shot transfer harnesses task-table context, relational attention patterns and schema semantics. Overall, RT provides a practical path toward foundation models for relational data.