🤖 AI Summary
This work addresses the challenge of systematically evaluating implicit and structural gender bias in large language model (LLM) outputs—particularly in models like GPT-4o—where conventional fairness metrics lack sensitivity and interpretability. Methodologically, we propose a multidimensional, reproducible fairness assessment paradigm integrating human annotation, adversarial prompt engineering, and cross-contextual template generation. Our structured corpus is annotated along three core dimensions: occupation–gender associations, pronoun consistency, and sociocultural role stereotyping—enabling norm-sensitive, fine-grained bias analysis beyond superficial neutrality measures. As a key contribution, we release the first open-source, GPT-4o–specific gender bias evaluation benchmark, featuring transparent annotation protocols, standardized prompts, and interpretable bias scores. This resource significantly enhances result explainability, cross-study comparability, and methodological transparency, establishing a standardized, reproducible infrastructure for LLM fairness research.
📝 Abstract
The widespread application of Large Language Models (LLMs) involves ethical risks for users and societies. A prominent ethical risk of LLMs is the generation of unfair language output that reinforces or exacerbates harm for members of disadvantaged social groups through gender biases (Weidinger et al., 2022; Bender et al., 2021; Kotek et al., 2023). Hence, the evaluation of the fairness of LLM outputs with respect to such biases is a topic of rising interest. To advance research in this field, promote discourse on suitable normative bases and evaluation methodologies, and enhance the reproducibility of related studies, we propose a novel approach to database construction. This approach enables the assessment of gender-related biases in LLM-generated language beyond merely evaluating their degree of neutralization.