🤖 AI Summary
This study addresses the challenge of conditional density estimation (CDE) for tabular data under heteroscedastic, multimodal, or asymmetric uncertainty by providing the first systematic evaluation of tabular foundation models—specifically TabPFN and TabICL—for this task. Leveraging 39 real-world datasets, the authors comprehensively compare these models against parametric, tree-based, and neural network approaches across six metrics encompassing density accuracy, calibration, and computational efficiency. Results demonstrate that tabular foundation models consistently outperform existing methods on the majority of datasets, achieving state-of-the-art performance in terms of CDE loss, log-likelihood, and continuous ranked probability score (CRPS). Notably, they exhibit exceptional calibration in low-data regimes and surpass specialized baselines on the SDSS DR18 photometric redshift prediction task using only 50,000 samples—far fewer than the 500,000 samples required by competing methods—thereby highlighting their strong generalization capabilities.
📝 Abstract
Conditional density estimation (CDE) - recovering the full conditional distribution of a response given tabular covariates - is essential in settings with heteroscedasticity, multimodality, or asymmetric uncertainty. Recent tabular foundation models, such as TabPFN and TabICL, naturally produce predictive distributions, but their effectiveness as general-purpose CDE methods has not been systematically evaluated, unlike their performance for point prediction, which is well studied. We benchmark three tabular foundation model variants against a diverse set of parametric, tree-based, and neural CDE baselines on 39 real-world datasets, across training sizes from 50 to 20,000, using six metrics covering density accuracy, calibration, and computation time. Across all sample sizes, foundation models achieve the best CDE loss, log-likelihood, and CRPS on the large majority of datasets tested. Calibration is competitive at small sample sizes but, for some metrics and datasets, lags behind task-specific neural baselines at larger sample sizes, suggesting that post-hoc recalibration may be a valuable complement. In a photometric redshift case study using SDSS DR18, TabPFN exposed to 50,000 training galaxies outperforms all baselines trained on the full 500,000-galaxy dataset. Taken together, these results establish tabular foundation models as strong off-the-shelf conditional density estimators.