Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLMs

📅 2025-05-05

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work investigates gender–nationality intersectional bias in large language models (LLMs) for occupational recommendation within multilingual settings. To address the lack of standardized evaluation, we construct a multilingual benchmark—covering 25 countries and four pronoun categories across English, Spanish, and German—and introduce the first framework for quantifying intersectional bias in multilingual LLMs. Using zero-shot inference on five Llama-family models, we employ systematic prompt engineering and multidimensional metrics—including occupational distribution entropy and bias score differential—to measure bias magnitude and stability. Results reveal pervasive intersectional bias across all models; instruction-tuned variants exhibit the lowest and most consistent bias levels; and switching the language of prompts significantly modulates bias expression. Critically, unidimensional fairness interventions prove insufficient to mitigate joint biases, underscoring the necessity and urgency of multilingual intersectional fairness assessment.

Technology Category

Application Category

📝 Abstract

One of the goals of fairness research in NLP is to measure and mitigate stereotypical biases that are propagated by NLP systems. However, such work tends to focus on single axes of bias (most often gender) and the English language. Addressing these limitations, we contribute the first study of multilingual intersecting country and gender biases, with a focus on occupation recommendations generated by large language models. We construct a benchmark of prompts in English, Spanish and German, where we systematically vary country and gender, using 25 countries and four pronoun sets. Then, we evaluate a suite of 5 Llama-based models on this benchmark, finding that LLMs encode significant gender and country biases. Notably, we find that even when models show parity for gender or country individually, intersectional occupational biases based on both country and gender persist. We also show that the prompting language significantly affects bias, and instruction-tuned models consistently demonstrate the lowest and most stable levels of bias. Our findings highlight the need for fairness researchers to use intersectional and multilingual lenses in their work.

Problem

Research questions and friction points this paper is trying to address.

Measure multilingual gender and country biases in LLMs

Assess intersectional biases in occupation recommendations

Evaluate bias variations across languages and models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual intersectional bias benchmark creation

Evaluation of Llama models on gender-country biases

Instruction-tuned models show lowest bias levels

🔎 Similar Papers

Hire Me or Not? Examining Language Model's Behavior with Occupation Attributes