Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study addresses the limitations of current tabular foundation models (TFMs), which, despite strong individual performance, exhibit constrained generalization, while conventional ensemble methods suffer from model redundancy and calibration degradation. The authors systematically evaluate six modern TFMs across 153 OpenML classification tasks using six ensemble strategies, quantifying diversity via the Q-statistic and conducting comprehensive comparisons through Friedman–Nemenyi tests and calibration analysis. Their findings reveal an upper bound on ensemble diversity among TFMs (mean Q = 0.961), with most ensembles yielding only marginal accuracy gains (+0.18%) at a substantial computational cost (253× increase) and often impairing probability calibration. Among the strategies examined, greedy selection achieves the best trade-off between performance and efficiency, emerging as a practical default for TFM ensembling.

📝 Abstract

Tabular foundation models (TFMs) now match or beat tuned gradient-boosted trees on a growing fraction of tabular tasks, but no single TFM wins on every dataset. Ensembling is the go to fix here, and it works less well than expected. Six modern TFMs form a near-redundant pool: their mean pairwise Q-statistic is $0.961$, close enough to $1$ that any convex combination is bounded above. We benchmark six ensemble strategies over six TFMs on 153 OpenML classification tasks. The best ensemble, two-level cascade stacking, buys $+0.18\%$ accuracy over the strongest single TFM at $253\times$ the compute. A Friedman and Nemenyi analysis places three ensembles and the best base TFM in a single equivalence group; three other ensembles are significantly \emph{worse} than the best base. Stacking with a logistic-regression meta-learner is the most striking case: competitive accuracy and ROC-AUC, the worst log-loss rank among the ensembles. The meta-learner improves accuracy by sharpening class boundaries, which destroys calibration. We recommend greedy selection as the practical default.

Problem

Research questions and friction points this paper is trying to address.

Tabular Foundation Models

Ensembling

Model Diversity

Calibration

Redundancy

Innovation

Methods, ideas, or system contributions that make the work stand out.

tabular foundation models

ensemble diversity

calibration trap