TABFAIRGDT: A Fast Fair Tabular Data Generator using Autoregressive Decision Trees

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the trade-off between fairness and utility in synthetic tabular data generation. Methodologically: (1) it employs nonparametric decision trees to model complex dependencies among mixed-type features—eliminating distributional assumptions and manual preprocessing; (2) it introduces a soft-leaf node resampling mechanism that directly mitigates bias correlated with sensitive attributes during generation; (3) the entire framework is CPU-efficient and enables end-to-end fair synthesis. Evaluated on multiple benchmark datasets, our approach achieves, on average, a 72% speedup over state-of-the-art deep generative models, synthesizing medium-scale fair tabular data within one second. It simultaneously improves fairness—reducing statistical parity difference by 41%—and preserves utility, maintaining ≥98% of the original data’s F1 score on downstream ML tasks. To our knowledge, this is the first framework achieving efficient, general-purpose, and interpretable fair tabular data synthesis.

Technology Category

Application Category

📝 Abstract

Ensuring fairness in machine learning remains a significant challenge, as models often inherit biases from their training data. Generative models have recently emerged as a promising approach to mitigate bias at the data level while preserving utility. However, many rely on deep architectures, despite evidence that simpler models can be highly effective for tabular data. In this work, we introduce TABFAIRGDT, a novel method for generating fair synthetic tabular data using autoregressive decision trees. To enforce fairness, we propose a soft leaf resampling technique that adjusts decision tree outputs to reduce bias while preserving predictive performance. Our approach is non-parametric, effectively capturing complex relationships between mixed feature types, without relying on assumptions about the underlying data distributions. We evaluate TABFAIRGDT on benchmark fairness datasets and demonstrate that it outperforms state-of-the-art (SOTA) deep generative models, achieving better fairness-utility trade-off for downstream tasks, as well as higher synthetic data quality. Moreover, our method is lightweight, highly efficient, and CPU-compatible, requiring no data pre-processing. Remarkably, TABFAIRGDT achieves a 72% average speedup over the fastest SOTA baseline across various dataset sizes, and can generate fair synthetic data for medium-sized datasets (10 features, 10K samples) in just one second on a standard CPU, making it an ideal solution for real-world fairness-sensitive applications.

Problem

Research questions and friction points this paper is trying to address.

Generating fair synthetic tabular data while preserving utility

Mitigating bias in machine learning models at data level

Overcoming limitations of deep architectures for tabular data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive decision trees generate fair tabular data

Soft leaf resampling adjusts outputs to reduce bias

Non-parametric method captures complex feature relationships

🔎 Similar Papers

Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models