Underrepresentation, Label Bias, and Proxies: Towards Data Bias Profiles for the EU AI Act and Beyond

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Data bias—arising from underrepresentation, label bias, and proxy variables—is a primary driver of algorithmic discrimination, yet systematic identification and quantification of such biases lack a unified framework. This paper introduces the Data Bias Profiling (DBP) framework: the first data-centric approach enabling joint modeling of multidimensional bias signals and predictive risk assessment for discrimination. DBP integrates fairness auditing mechanisms, multi-granularity bias metrics, and benchmark fair datasets into a scalable bias detection pipeline. Empirical analysis reveals that the synergistic effect of proxy variables and label bias induces stronger discriminatory outcomes than sample-level underrepresentation alone. Experiments across multiple standard fairness benchmarks demonstrate that DBP accurately predicts model-level discrimination risk and effectively guides targeted interventions, thereby bridging the gap between algorithmic fairness research and real-world anti-discrimination policy implementation.

Technology Category

Application Category

📝 Abstract
Undesirable biases encoded in the data are key drivers of algorithmic discrimination. Their importance is widely recognized in the algorithmic fairness literature, as well as legislation and standards on anti-discrimination in AI. Despite this recognition, data biases remain understudied, hindering the development of computational best practices for their detection and mitigation. In this work, we present three common data biases and study their individual and joint effect on algorithmic discrimination across a variety of datasets, models, and fairness measures. We find that underrepresentation of vulnerable populations in training sets is less conducive to discrimination than conventionally affirmed, while combinations of proxies and label bias can be far more critical. Consequently, we develop dedicated mechanisms to detect specific types of bias, and combine them into a preliminary construct we refer to as the Data Bias Profile (DBP). This initial formulation serves as a proof of concept for how different bias signals can be systematically documented. Through a case study with popular fairness datasets, we demonstrate the effectiveness of the DBP in predicting the risk of discriminatory outcomes and the utility of fairness-enhancing interventions. Overall, this article bridges algorithmic fairness research and anti-discrimination policy through a data-centric lens.
Problem

Research questions and friction points this paper is trying to address.

Study underrepresentation and label bias in AI datasets
Analyze joint effects of data biases on discrimination
Develop Data Bias Profile to detect bias systematically
Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops mechanisms to detect specific data biases
Combines bias signals into Data Bias Profile (DBP)
Uses DBP to predict discriminatory outcomes risk