Are you sure? Measuring models bias in content moderation through uncertainty

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses implicit racial and sociodemographic biases—particularly against women and non-White individuals—in language models used for automated content moderation. Recognizing that conventional metrics (e.g., accuracy, F1-score) fail to expose systemic representational biases, we introduce conformal prediction as a novel framework for fairness evaluation. Our method quantifies prediction uncertainty across socially stratified annotation groups to detect underrepresentation and miscalibration in pre-trained models. Experiments span 11 state-of-the-art language models evaluated on annotations from women and non-White annotators. Results reveal that high-accuracy models frequently exhibit significantly lower predictive confidence on minority-group data—a strong indicator of structural bias in internal representations. This uncertainty-aware fairness assessment paradigm provides an interpretable, scalable, and theoretically grounded approach for diagnosing and mitigating bias in large language models.

Technology Category

Application Category

📝 Abstract
Automatic content moderation is crucial to ensuring safety in social media. Language Model-based classifiers are being increasingly adopted for this task, but it has been shown that they perpetuate racial and social biases. Even if several resources and benchmark corpora have been developed to challenge this issue, measuring the fairness of models in content moderation remains an open issue. In this work, we present an unsupervised approach that benchmarks models on the basis of their uncertainty in classifying messages annotated by people belonging to vulnerable groups. We use uncertainty, computed by means of the conformal prediction technique, as a proxy to analyze the bias of 11 models against women and non-white annotators and observe to what extent it diverges from metrics based on performance, such as the $F_1$ score. The results show that some pre-trained models predict with high accuracy the labels coming from minority groups, even if the confidence in their prediction is low. Therefore, by measuring the confidence of models, we are able to see which groups of annotators are better represented in pre-trained models and lead the debiasing process of these models before their effective use.
Problem

Research questions and friction points this paper is trying to address.

Measuring bias in content moderation models through uncertainty analysis
Evaluating model fairness against women and non-white annotators
Using confidence metrics to identify underrepresented groups in models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised bias measurement using model uncertainty
Conformal prediction technique for confidence analysis
Benchmarking models against vulnerable group annotations
🔎 Similar Papers
No similar papers found.