🤖 AI Summary
This work investigates gender–nationality intersectional bias in large language models (LLMs) for occupational recommendation within multilingual settings. To address the lack of standardized evaluation, we construct a multilingual benchmark—covering 25 countries and four pronoun categories across English, Spanish, and German—and introduce the first framework for quantifying intersectional bias in multilingual LLMs. Using zero-shot inference on five Llama-family models, we employ systematic prompt engineering and multidimensional metrics—including occupational distribution entropy and bias score differential—to measure bias magnitude and stability. Results reveal pervasive intersectional bias across all models; instruction-tuned variants exhibit the lowest and most consistent bias levels; and switching the language of prompts significantly modulates bias expression. Critically, unidimensional fairness interventions prove insufficient to mitigate joint biases, underscoring the necessity and urgency of multilingual intersectional fairness assessment.
📝 Abstract
One of the goals of fairness research in NLP is to measure and mitigate stereotypical biases that are propagated by NLP systems. However, such work tends to focus on single axes of bias (most often gender) and the English language. Addressing these limitations, we contribute the first study of multilingual intersecting country and gender biases, with a focus on occupation recommendations generated by large language models. We construct a benchmark of prompts in English, Spanish and German, where we systematically vary country and gender, using 25 countries and four pronoun sets. Then, we evaluate a suite of 5 Llama-based models on this benchmark, finding that LLMs encode significant gender and country biases. Notably, we find that even when models show parity for gender or country individually, intersectional occupational biases based on both country and gender persist. We also show that the prompting language significantly affects bias, and instruction-tuned models consistently demonstrate the lowest and most stable levels of bias. Our findings highlight the need for fairness researchers to use intersectional and multilingual lenses in their work.