🤖 AI Summary
Machine learning models often amplify inherent societal biases in data, disproportionately harming minority groups. This paper addresses fairness risks at the earliest stage of the ML pipeline—data collection—and introduces the first fairness risk auditing framework grounded in Algorithmic Information Theory (AIT). Our method quantifies bias risk via AIT, integrating statistical bias metrics with cross-dataset benchmarking to jointly achieve three objectives: (i) data-level risk prediction, (ii) identification of model families sensitive to bias, and (iii) interpretable attribution of risk sources. Experiments on standard fairness benchmarks demonstrate that the framework accurately identifies high-risk datasets and bias-prone model families, significantly improving early detection rates and interpretability of bias risks. By enabling proactive, data-centric fairness assessment, it provides a deployable, pre-modeling audit tool for fair ML development.
📝 Abstract
Machine Learning algorithms (ML) impact virtually every aspect of human lives and have found use across diverse sectors including healthcare, finance, and education. Often, ML algorithms have been found to exacerbate societal biases present in datasets leading to adversarial impacts on subsets/groups of individuals and in many cases on minority groups. To effectively mitigate these untoward effects, it is crucial that disparities/biases are identified early in a ML pipeline. This proactive approach facilitates timely interventions to prevent bias amplification and reduce complexity at later stages of model development. In this paper, we leverage recent advancements in usable information theory to introduce DispaRisk, a novel framework designed to proactively assess the potential risks of disparities in datasets during the initial stages of the ML pipeline. We evaluate DispaRisk's effectiveness by benchmarking it against commonly used datasets in fairness research. Our findings demonstrate DispaRisk's capabilities to identify datasets with a high risk of discrimination, detect model families prone to biases within an ML pipeline, and enhance the explainability of these bias risks. This work contributes to the development of fairer ML systems by providing a robust tool for early bias detection and mitigation.