A database to support the evaluation of gender biases in GPT-4o output

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of systematically evaluating implicit and structural gender bias in large language model (LLM) outputs—particularly in models like GPT-4o—where conventional fairness metrics lack sensitivity and interpretability. Methodologically, we propose a multidimensional, reproducible fairness assessment paradigm integrating human annotation, adversarial prompt engineering, and cross-contextual template generation. Our structured corpus is annotated along three core dimensions: occupation–gender associations, pronoun consistency, and sociocultural role stereotyping—enabling norm-sensitive, fine-grained bias analysis beyond superficial neutrality measures. As a key contribution, we release the first open-source, GPT-4o–specific gender bias evaluation benchmark, featuring transparent annotation protocols, standardized prompts, and interpretable bias scores. This resource significantly enhances result explainability, cross-study comparability, and methodological transparency, establishing a standardized, reproducible infrastructure for LLM fairness research.

Technology Category

Application Category

📝 Abstract
The widespread application of Large Language Models (LLMs) involves ethical risks for users and societies. A prominent ethical risk of LLMs is the generation of unfair language output that reinforces or exacerbates harm for members of disadvantaged social groups through gender biases (Weidinger et al., 2022; Bender et al., 2021; Kotek et al., 2023). Hence, the evaluation of the fairness of LLM outputs with respect to such biases is a topic of rising interest. To advance research in this field, promote discourse on suitable normative bases and evaluation methodologies, and enhance the reproducibility of related studies, we propose a novel approach to database construction. This approach enables the assessment of gender-related biases in LLM-generated language beyond merely evaluating their degree of neutralization.
Problem

Research questions and friction points this paper is trying to address.

Evaluate gender biases in GPT-4 outputs
Develop database for ethical LLM assessment
Enhance reproducibility in bias evaluation studies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Database construction for gender bias evaluation
Assessing biases beyond neutralization degree
Enhancing reproducibility in LLM fairness studies
🔎 Similar Papers
No similar papers found.
L
Luise Mehner
TU Berlin, Einsteinufer 17, 10587 Berlin
L
Lena Fiedler
TU Berlin, Einsteinufer 17, 10587 Berlin
Sabine Ammon
Sabine Ammon
Technische Universität Berlin
Dorothea Kolossa
Dorothea Kolossa
Technische Universität Berlin
Human-centered techMultimodal Signal ProcessingSpeech Recognition