Unmasking Trees for Tabular Data

📅 2024-07-08

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

164K/year

🤖 AI Summary

To address the persistent performance gap of deep generative models versus traditional methods in tabular data generation and missing-value imputation, this paper introduces UnmaskingTrees—a novel framework that employs gradient-boosted trees (GBTs) for incremental feature unmasking to enable efficient conditional imputation and generation. We further propose BaltoBot, a tree-based probabilistic modeling framework that requires no distributional assumptions, natively supports discrete variables and multimodal outputs, and enables closed-form density estimation and fast sampling. Crucially, both frameworks are the first to integrate GBTs into the in-context learning paradigm and incorporate TabPFN to enhance generalization. Experiments demonstrate state-of-the-art performance on missing-value imputation; superior results on generative modeling under missing training data; and highly competitive performance on standard tabular generation benchmarks.

Technology Category

Application Category

📝 Abstract

Despite much work on advanced deep learning and generative modeling techniques for tabular data generation and imputation, traditional methods have continued to win on imputation benchmarks. We herein present UnmaskingTrees, a simple method for tabular imputation (and generation) employing gradient-boosted decision trees which are used to incrementally unmask individual features. This approach offers state-of-the-art performance on imputation, and on generation given training data with missingness; and it has competitive performance on vanilla generation. To solve the conditional generation subproblem, we propose a tabular probabilistic prediction method, BaltoBot, which fits a balanced tree of boosted tree classifiers. Unlike older methods, it requires no parametric assumption on the conditional distribution, accommodating features with multimodal distributions; unlike newer diffusion methods, it offers fast sampling, closed-form density estimation, and flexible handling of discrete variables. We finally consider our two approaches as meta-algorithms, demonstrating in-context learning-based generative modeling with TabPFN.

Problem

Research questions and friction points this paper is trying to address.

Improving tabular data imputation with gradient-boosted trees

Enhancing conditional generation via non-parametric probabilistic prediction

Achieving state-of-the-art performance on missing data generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses gradient-boosted decision trees for imputation

Proposes BaltoBot for conditional generation

Combines in-context learning with TabPFN

🔎 Similar Papers

A Closer Look at Deep Learning Methods on Tabular Datasets