GG-BBQ: German Gender Bias Benchmark for Question Answering

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses gender bias in German large language models (LLMs) on question-answering tasks, a critical yet underexplored issue due to German’s grammatical gender system. Method: We construct the first high-quality, German-specific bias evaluation benchmark, designed explicitly for grammatical gender phenomena. The dataset comprises two subsets—group terms and proper names—generated by translating English templates and rigorously refined by native German linguists to avoid machine-translation artifacts. We employ a question-answering evaluation framework to quantify both accuracy and gender bias across multiple German LLMs. Contribution/Results: All evaluated models exhibit significant gender bias—some amplifying, others contradicting societal stereotypes—revealing systemic fairness deficiencies in current German LLMs. This work establishes the first standardized, grammar-aware methodology for bias evaluation in German, providing a reproducible paradigm and publicly available benchmark for fairness research in non-English LLMs.

Technology Category

Application Category

📝 Abstract
Within the context of Natural Language Processing (NLP), fairness evaluation is often associated with the assessment of bias and reduction of associated harm. In this regard, the evaluation is usually carried out by using a benchmark dataset, for a task such as Question Answering, created for the measurement of bias in the model's predictions along various dimensions, including gender identity. In our work, we evaluate gender bias in German Large Language Models (LLMs) using the Bias Benchmark for Question Answering by Parrish et al. (2022) as a reference. Specifically, the templates in the gender identity subset of this English dataset were machine translated into German. The errors in the machine translated templates were then manually reviewed and corrected with the help of a language expert. We find that manual revision of the translation is crucial when creating datasets for gender bias evaluation because of the limitations of machine translation from English to a language such as German with grammatical gender. Our final dataset is comprised of two subsets: Subset-I, which consists of group terms related to gender identity, and Subset-II, where group terms are replaced with proper names. We evaluate several LLMs used for German NLP on this newly created dataset and report the accuracy and bias scores. The results show that all models exhibit bias, both along and against existing social stereotypes.
Problem

Research questions and friction points this paper is trying to address.

Evaluating gender bias in German Large Language Models
Creating a German gender bias benchmark for Question Answering
Assessing bias and accuracy in German NLP models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine translated English gender bias templates
Manually corrected German translations with experts
Evaluated German LLMs on new bias dataset
🔎 Similar Papers
No similar papers found.