🤖 AI Summary
Addressing the fundamental issue in fair machine learning—“biased input leads to biased output”—this work tackles the challenge of achieving strict group fairness without compromising predictive utility.
Method: We formally define the concept of an “ideal distribution” and propose a rigorous optimization framework based on KL-divergence minimization, provably guaranteeing group fairness. The framework integrates cost-sensitive risk minimization with affine transformations, enabling exact fairness (e.g., demographic parity, equalized odds) in both generative models and large language model representation spaces—without utility loss. It supports multiple parametric distributions and ensures computational tractability.
Contribution/Results: Evaluated on synthetic benchmarks and the Bios occupation prediction task, our approach significantly improves fairness metrics (e.g., disparity reduction by up to 87%) while preserving or even enhancing predictive accuracy—thereby breaking the conventional fairness–utility trade-off.
📝 Abstract
To fix the 'bias in, bias out' problem in fair machine learning, it is important to steer feature distributions of data or internal representations of Large Language Models (LLMs) to ideal ones that guarantee group-fair outcomes. Previous work on fair generative models and representation steering could greatly benefit from provable fairness guarantees on the model output. We define a distribution as ideal if the minimizer of any cost-sensitive risk on it is guaranteed to have exact group-fair outcomes (e.g., demographic parity, equal opportunity)-in other words, it has no fairness-utility trade-off. We formulate an optimization program for optimal steering by finding the nearest ideal distribution in KL-divergence, and provide efficient algorithms for it when the underlying distributions come from well-known parametric families (e.g., normal, log-normal). Empirically, our optimal steering techniques on both synthetic and real-world datasets improve fairness without diminishing utility (and sometimes even improve utility). We demonstrate affine steering of LLM representations to reduce bias in multi-class classification, e.g., occupation prediction from a short biography in Bios dataset (De-Arteaga et al.). Furthermore, we steer internal representations of LLMs towards desired outputs so that it works equally well across different groups.