Towards Fair Rankings: Leveraging LLMs for Gender Bias Detection and Measurement

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing gender fairness evaluation metrics—such as frequency-based approaches—fail to capture subtle biases in paragraph-level ranking. To address this, we propose a fine-grained gender bias detection framework tailored for information retrieval. Our method introduces (1) Category-Weighted Exposure (CWEx), a novel fairness metric that explicitly models exposure disparity across gender-related query categories; (2) MSMGenderBias, the first publicly available paragraph-level gender bias annotation dataset, built upon MS MARCO; and (3) a high-consistency evaluation benchmark integrating large language model reasoning with human annotation. Experiments demonstrate that CWEx significantly outperforms baseline metrics across multiple ranking models, achieving a Cohen’s Kappa of 0.5877 with human judgments—a 18.51% improvement—thereby enhancing both the accuracy and interpretability of fairness assessment in IR systems.

Technology Category

Application Category

📝 Abstract

The presence of social biases in Natural Language Processing (NLP) and Information Retrieval (IR) systems is an ongoing challenge, which underlines the importance of developing robust approaches to identifying and evaluating such biases. In this paper, we aim to address this issue by leveraging Large Language Models (LLMs) to detect and measure gender bias in passage ranking. Existing gender fairness metrics rely on lexical- and frequency-based measures, leading to various limitations, e.g., missing subtle gender disparities. Building on our LLM-based gender bias detection method, we introduce a novel gender fairness metric, named Class-wise Weighted Exposure (CWEx), aiming to address existing limitations. To measure the effectiveness of our proposed metric and study LLMs' effectiveness in detecting gender bias, we annotate a subset of the MS MARCO Passage Ranking collection and release our new gender bias collection, called MSMGenderBias, to foster future research in this area. Our extensive experimental results on various ranking models show that our proposed metric offers a more detailed evaluation of fairness compared to previous metrics, with improved alignment to human labels (58.77% for Grep-BiasIR, and 18.51% for MSMGenderBias, measured using Cohen's Kappa agreement), effectively distinguishing gender bias in ranking. By integrating LLM-driven bias detection, an improved fairness metric, and gender bias annotations for an established dataset, this work provides a more robust framework for analyzing and mitigating bias in IR systems.

Problem

Research questions and friction points this paper is trying to address.

Detect and measure gender bias in passage ranking using LLMs

Introduce a novel fairness metric to address existing limitations

Provide a robust framework for analyzing bias in IR systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging LLMs for gender bias detection

Introducing Class-wise Weighted Exposure metric

Creating MSMGenderBias dataset for evaluation

🔎 Similar Papers

No similar papers found.