🤖 AI Summary
This paper addresses the exacerbation of group bias and degradation of downstream task fairness caused by feature discretization (binning). We formally define the “unbiased binning” problem for the first time and propose an ε-biased binning framework grounded in boundary candidate set theory, enabling controllable small biases to flexibly trade off fairness against binning quality. Methodologically, we design an efficient solver integrating dynamic programming, local search, and a divide-and-conquer strategy—where the latter yields near-optimal solutions in nearly linear time. Experiments demonstrate that our approach significantly reduces inter-group distributional disparities and improves fairness across classification and regression tasks (e.g., up to 42% reduction in equal opportunity difference), while maintaining strong scalability: the local search component remains effective even on million-scale datasets.
📝 Abstract
Discretizing raw features into bucketized attribute representations is a popular step before sharing a dataset. It is, however, evident that this step can cause significant bias in data and amplify unfairness in downstream tasks.
In this paper, we address this issue by introducing the unbiased binning problem that, given an attribute to bucketize, finds its closest discretization to equal-size binning that satisfies group parity across different buckets. Defining a small set of boundary candidates, we prove that unbiased binning must select its boundaries from this set. We then develop an efficient dynamic programming algorithm on top of the boundary candidates to solve the unbiased binning problem.
Finding an unbiased binning may sometimes result in a high price of fairness, or it may not even exist, especially when group values follow different distributions. Considering that a small bias in the group ratios may be tolerable in such settings, we introduce the epsilon-biased binning problem that bounds the group disparities across buckets to a small value epsilon. We first develop a dynamic programming solution, DP, that finds the optimal binning in quadratic time. The DP algorithm, while polynomial, does not scale to very large settings. Therefore, we propose a practically scalable algorithm, based on local search (LS), for epsilon-biased binning. The key component of the LS algorithm is a divide-and-conquer (D&C) algorithm that finds a near-optimal solution for the problem in near-linear time. We prove that D&C finds a valid solution for the problem unless none exists. The LS algorithm then initiates a local search, using the D&C solution as the upper bound, to find the optimal solution.