🤖 AI Summary
This work addresses the challenge of building efficient and general-purpose predictive models for large-scale relational data without relying on manual feature engineering or table flattening, particularly under cold-start and high-noise conditions. The authors propose the first foundation model tailored for relational data, which natively supports multi-table joins and temporal consistency. It leverages four-dimensional pretraining—spanning rows, columns, foreign keys, and cross-sample contexts—and integrates in-context learning with fine-tuning. A novel early task information injection mechanism is introduced to substantially enhance column selection accuracy and robustness to noise. Evaluated across 41 benchmarks, the model outperforms existing supervised and foundation approaches by an average of 8%, demonstrating both scalability and superior performance on datasets at billion-scale.
📝 Abstract
We introduce KumoRFM-2, the next iteration of a pre-trained foundation model for relational data. KumoRFM-2 supports in-context learning as well as fine-tuning and is applicable to a wide range of predictive tasks. In contrast to tabular foundation models, KumoRFM-2 natively operates on relational data, processing one or more connected tables simultaneously without manual table flattening or target variable generation, all while preserving temporal consistency. KumoRFM-2 leverages a large corpus of synthetic and real-world data to pre-train across four axes: the row and column dimensions at the individual table level, and the foreign key and cross-sample dimensions at the database level. In contrast to its predecessor, KumoRFM-2 injects task information as early as possible, enabling sharper selection of task-relevant columns and improved robustness to noisy data. Through extensive experiments on 41 challenging benchmarks and analysis around expressivity and sensitivity, we demonstrate that KumoRFM-2 outperforms supervised and foundational approaches by up to 8%, while maintaining strong performance under extreme settings of cold start and noisy data. To our knowledge, this is the first time a few-shot foundation model has been shown to surpass supervised approaches on common benchmark tasks, with performance further improving upon fine-tuning. Finally, while KumoRFM-1 was limited to small-scale in-memory datasets, KumoRFM-2 scales to billion-scale relational datasets.