Are Bias Evaluation Methods Biased ?

📅 2025-06-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Are existing bias evaluation benchmarks reliable? This paper systematically investigates the robustness of mainstream large language model (LLM) bias evaluation methods—including BOLD, CrowS-Pairs, and StereoSet—through cross-benchmark assessment and ranking consistency analysis using Kendall’s τ and Spearman’s ρ. Our first empirical finding reveals severe inconsistency in model rankings across benchmarks, with an average Kendall’s τ of only 0.32. This indicates that current benchmarks lack cross-method comparability, rendering individual evaluation scores insufficient for reliable safety-oriented model comparison. Crucially, we identify the evaluation methodologies themselves as a systematic source of bias, challenging the validity and reliability of prevailing benchmarks. Based on these findings, we propose community-wide evaluation guidelines advocating a paradigm shift—from “method-centric” assessment toward “robustness-first” bias evaluation—thereby advancing methodological rigor and interpretability in LLM fairness research.

Technology Category

Application Category

📝 Abstract
The creation of benchmarks to evaluate the safety of Large Language Models is one of the key activities within the trusted AI community. These benchmarks allow models to be compared for different aspects of safety such as toxicity, bias, harmful behavior etc. Independent benchmarks adopt different approaches with distinct data sets and evaluation methods. We investigate how robust such benchmarks are by using different approaches to rank a set of representative models for bias and compare how similar are the overall rankings. We show that different but widely used bias evaluations methods result in disparate model rankings. We conclude with recommendations for the community in the usage of such benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating robustness of bias benchmarks for LLMs
Comparing model rankings across different bias methods
Assessing consistency in widely used bias evaluations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compare model rankings using diverse bias evaluations
Investigate robustness of independent benchmark approaches
Recommend improved usage of bias evaluation benchmarks
🔎 Similar Papers
No similar papers found.
L
Lina Berrayana
IBM Research Europe – Zurich Lab, Switzerland; École Polytechnique Fédérale de Lausanne, Switzerland
S
Sean Rooney
IBM Research Europe – Zurich Lab, Switzerland
L
Luis Garcés-Erice
IBM Research Europe – Zurich Lab, Switzerland
Ioana Giurgiu
Ioana Giurgiu
IBM Zurich
Cloud computingBig dataMobile devices