Benchmarking Gender and Political Bias in Large Language Models

๐Ÿ“… 2025-09-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work systematically evaluates gender and political-orientation biases in large language models (LLMs) within politically sensitive contexts. To this end, we introduce EuroParlVoteโ€”the first benchmark dataset integrating European Parliament speech transcripts, roll-call voting records, and multidimensional demographic attributes (e.g., gender, party affiliation, ideological position). We propose a novel dual-task evaluation framework that jointly performs MP gender classification and vote-prediction while quantifying robustness and group fairness. Experimental results reveal that mainstream LLMs consistently misclassify female MPs and exhibit significantly lower prediction accuracy for ideologically extreme political groups. GPT-4o outperforms open-weight models across accuracy, robustness, and subgroup fairness metrics. Our study establishes a new methodological paradigm and a reproducible benchmark for bias assessment of LLMs in political discourse.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce EuroParlVote, a novel benchmark for evaluating large language models (LLMs) in politically sensitive contexts. It links European Parliament debate speeches to roll-call vote outcomes and includes rich demographic metadata for each Member of the European Parliament (MEP), such as gender, age, country, and political group. Using EuroParlVote, we evaluate state-of-the-art LLMs on two tasks -- gender classification and vote prediction -- revealing consistent patterns of bias. We find that LLMs frequently misclassify female MEPs as male and demonstrate reduced accuracy when simulating votes for female speakers. Politically, LLMs tend to favor centrist groups while underperforming on both far-left and far-right ones. Proprietary models like GPT-4o outperform open-weight alternatives in terms of both robustness and fairness. We release the EuroParlVote dataset, code, and demo to support future research on fairness and accountability in NLP within political contexts.
Problem

Research questions and friction points this paper is trying to address.

Evaluating gender and political bias in LLMs
Assessing vote prediction and gender classification accuracy
Analyzing performance disparities across political spectrums
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel benchmark linking speeches to votes
Evaluates LLMs on gender and vote prediction
Releases dataset with demographic metadata for research
๐Ÿ”Ž Similar Papers
No similar papers found.