Unmasking Trees for Tabular Data

๐Ÿ“… 2024-07-08
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 3
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the persistent performance gap of deep generative models versus traditional methods in tabular data generation and missing-value imputation, this paper introduces UnmaskingTreesโ€”a novel framework that employs gradient-boosted trees (GBTs) for incremental feature unmasking to enable efficient conditional imputation and generation. We further propose BaltoBot, a tree-based probabilistic modeling framework that requires no distributional assumptions, natively supports discrete variables and multimodal outputs, and enables closed-form density estimation and fast sampling. Crucially, both frameworks are the first to integrate GBTs into the in-context learning paradigm and incorporate TabPFN to enhance generalization. Experiments demonstrate state-of-the-art performance on missing-value imputation; superior results on generative modeling under missing training data; and highly competitive performance on standard tabular generation benchmarks.

Technology Category

Application Category

๐Ÿ“ Abstract
Despite much work on advanced deep learning and generative modeling techniques for tabular data generation and imputation, traditional methods have continued to win on imputation benchmarks. We herein present UnmaskingTrees, a simple method for tabular imputation (and generation) employing gradient-boosted decision trees which are used to incrementally unmask individual features. This approach offers state-of-the-art performance on imputation, and on generation given training data with missingness; and it has competitive performance on vanilla generation. To solve the conditional generation subproblem, we propose a tabular probabilistic prediction method, BaltoBot, which fits a balanced tree of boosted tree classifiers. Unlike older methods, it requires no parametric assumption on the conditional distribution, accommodating features with multimodal distributions; unlike newer diffusion methods, it offers fast sampling, closed-form density estimation, and flexible handling of discrete variables. We finally consider our two approaches as meta-algorithms, demonstrating in-context learning-based generative modeling with TabPFN.
Problem

Research questions and friction points this paper is trying to address.

Improving tabular data imputation with gradient-boosted trees
Enhancing conditional generation via non-parametric probabilistic prediction
Achieving state-of-the-art performance on missing data generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses gradient-boosted decision trees for imputation
Proposes BaltoBot for conditional generation
Combines in-context learning with TabPFN
๐Ÿ”Ž Similar Papers
No similar papers found.