Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

This paper addresses the imbalanced generalization capability of tabular models across small- and large-sample regimes. We propose two lightweight ensemble frameworks—LLM-Boost and PFN-Boost—that synergistically integrate pretrained priors from large language models (LLMs) and TabPFN with the scalability of gradient-boosted decision trees (GBDTs), marking the first such integration for tabular data. Crucially, our approach incurs no additional training overhead: it fuses contextual learning and Transformer-based representations at the feature level to enhance GBDT’s adaptability to multi-scale tabular data. Experiments show that PFN-Boost achieves state-of-the-art average performance across all dataset sizes except the smallest; LLM-Boost consistently outperforms standalone LLM, TabPFN, and GBDT baselines on medium-sized datasets. Our core contribution is the novel bridging of pretrained foundation models with tree-based learners for tabular learning—enabling scalable, sample-robust modeling without architectural or training-cost penalties.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) perform remarkably well on tabular datasets in zero- and few-shot settings, since they can extract meaning from natural language column headers that describe features and labels. Similarly, TabPFN, a recent non-LLM transformer pretrained on numerous tables for in-context learning, has demonstrated excellent performance for dataset sizes up to a thousand samples. In contrast, gradient-boosted decision trees (GBDTs) are typically trained from scratch on each dataset without benefiting from pretraining data and must learn the relationships between columns from their entries alone since they lack natural language understanding. LLMs and TabPFN excel on small tabular datasets where a strong prior is essential, yet they are not competitive with GBDTs on medium or large datasets, since their context lengths are limited. In this paper, we propose a simple and lightweight approach for fusing large language models and TabPFN with gradient-boosted decision trees, which allows scalable GBDTs to benefit from the natural language capabilities and pretraining of transformers. We name our fusion methods LLM-Boost and PFN-Boost, respectively. While matching or surpassing the performance of the transformer at sufficiently small dataset sizes and GBDTs at sufficiently large sizes, LLM-Boost and PFN-Boost outperform both standalone components on a wide range of dataset sizes in between. We demonstrate state-of-the-art performance against numerous baselines and ensembling algorithms. We find that PFN-Boost achieves the best average performance among all methods we test for all but very small dataset sizes. We release our code at http://github.com/MayukaJ/LLM-Boost .

Problem

Research questions and friction points this paper is trying to address.

Enhancing decision trees with transformer capabilities

Improving tabular data processing across various sample sizes

Combining LLMs and GBDTs for superior performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses LLMs with gradient-boosted decision trees

Combines TabPFN transformer with GBDTs

Enhances GBDTs using pretrained transformer capabilities

🔎 Similar Papers

Challenging Gradient Boosted Decision Trees with Tabular Transformers for Fraud Detection at Booking.com