🤖 AI Summary
This work addresses the trade-off between fairness and utility in synthetic tabular data generation. Methodologically: (1) it employs nonparametric decision trees to model complex dependencies among mixed-type features—eliminating distributional assumptions and manual preprocessing; (2) it introduces a soft-leaf node resampling mechanism that directly mitigates bias correlated with sensitive attributes during generation; (3) the entire framework is CPU-efficient and enables end-to-end fair synthesis. Evaluated on multiple benchmark datasets, our approach achieves, on average, a 72% speedup over state-of-the-art deep generative models, synthesizing medium-scale fair tabular data within one second. It simultaneously improves fairness—reducing statistical parity difference by 41%—and preserves utility, maintaining ≥98% of the original data’s F1 score on downstream ML tasks. To our knowledge, this is the first framework achieving efficient, general-purpose, and interpretable fair tabular data synthesis.
📝 Abstract
Ensuring fairness in machine learning remains a significant challenge, as models often inherit biases from their training data. Generative models have recently emerged as a promising approach to mitigate bias at the data level while preserving utility. However, many rely on deep architectures, despite evidence that simpler models can be highly effective for tabular data. In this work, we introduce TABFAIRGDT, a novel method for generating fair synthetic tabular data using autoregressive decision trees. To enforce fairness, we propose a soft leaf resampling technique that adjusts decision tree outputs to reduce bias while preserving predictive performance. Our approach is non-parametric, effectively capturing complex relationships between mixed feature types, without relying on assumptions about the underlying data distributions. We evaluate TABFAIRGDT on benchmark fairness datasets and demonstrate that it outperforms state-of-the-art (SOTA) deep generative models, achieving better fairness-utility trade-off for downstream tasks, as well as higher synthetic data quality. Moreover, our method is lightweight, highly efficient, and CPU-compatible, requiring no data pre-processing. Remarkably, TABFAIRGDT achieves a 72% average speedup over the fastest SOTA baseline across various dataset sizes, and can generate fair synthetic data for medium-sized datasets (10 features, 10K samples) in just one second on a standard CPU, making it an ideal solution for real-world fairness-sensitive applications.