EuroGEST: Investigating gender stereotypes in multilingual language models

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the longstanding limitation of gender bias evaluation for multilingual large language models (LLMs) being confined to English, this work introduces EuroGEST—the first benchmark covering 30 European languages. Built upon 16 expert-defined stereotype categories, EuroGEST enables reliable cross-lingual extension via translation alignment, morphology-informed heuristics, quality estimation, and human validation. We conduct the first systematic zero-shot gender bias evaluation across 24 multilingual LLMs on 29 non-English European languages. Key findings include: (1) increased model parameter count significantly exacerbates stereotyping; (2) instruction fine-tuning fails to consistently mitigate bias, with effects varying by language and task; and (3) female-associated terms frequently co-occur with “beauty” and “empathy”, whereas male-associated terms strongly correlate with “leadership” and “professionalism”. This work establishes a reproducible benchmark and methodological framework for multilingual fairness research.

Technology Category

Application Category

📝 Abstract
Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric. We introduce EuroGEST, a dataset designed to measure gender-stereotypical reasoning in LLMs across English and 29 European languages. EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality estimation metrics, and morphological heuristics. Human evaluations confirm that our data generation method results in high accuracy of both translations and gender labels across languages. We use EuroGEST to evaluate 24 multilingual language models from six model families, demonstrating that the strongest stereotypes in all models across all languages are that women are extit{beautiful,} extit{empathetic} and extit{neat} and men are extit{leaders}, extit{strong, tough} and extit{professional}. We also show that larger models encode gendered stereotypes more strongly and that instruction finetuning does not consistently reduce gendered stereotypes. Our work highlights the need for more multilingual studies of fairness in LLMs and offers scalable methods and resources to audit gender bias across languages.
Problem

Research questions and friction points this paper is trying to address.

Measuring gender stereotypes in multilingual language models
Expanding bias evaluation beyond English to 29 European languages
Assessing impact of model size and finetuning on stereotype strength
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual dataset for gender bias evaluation
Translation tools and morphological heuristics
Human-validated high accuracy translations
🔎 Similar Papers
No similar papers found.