A database to support the evaluation of gender biases in GPT-4o output

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the challenge of systematically evaluating implicit and structural gender bias in large language model (LLM) outputs—particularly in models like GPT-4o—where conventional fairness metrics lack sensitivity and interpretability. Methodologically, we propose a multidimensional, reproducible fairness assessment paradigm integrating human annotation, adversarial prompt engineering, and cross-contextual template generation. Our structured corpus is annotated along three core dimensions: occupation–gender associations, pronoun consistency, and sociocultural role stereotyping—enabling norm-sensitive, fine-grained bias analysis beyond superficial neutrality measures. As a key contribution, we release the first open-source, GPT-4o–specific gender bias evaluation benchmark, featuring transparent annotation protocols, standardized prompts, and interpretable bias scores. This resource significantly enhances result explainability, cross-study comparability, and methodological transparency, establishing a standardized, reproducible infrastructure for LLM fairness research.

Technology Category

Application Category

📝 Abstract

The widespread application of Large Language Models (LLMs) involves ethical risks for users and societies. A prominent ethical risk of LLMs is the generation of unfair language output that reinforces or exacerbates harm for members of disadvantaged social groups through gender biases (Weidinger et al., 2022; Bender et al., 2021; Kotek et al., 2023). Hence, the evaluation of the fairness of LLM outputs with respect to such biases is a topic of rising interest. To advance research in this field, promote discourse on suitable normative bases and evaluation methodologies, and enhance the reproducibility of related studies, we propose a novel approach to database construction. This approach enables the assessment of gender-related biases in LLM-generated language beyond merely evaluating their degree of neutralization.

Problem

Research questions and friction points this paper is trying to address.

Evaluate gender biases in GPT-4 outputs

Develop database for ethical LLM assessment

Enhance reproducibility in bias evaluation studies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Database construction for gender bias evaluation

Assessing biases beyond neutralization degree

Enhancing reproducibility in LLM fairness studies

🔎 Similar Papers

No similar papers found.