🤖 AI Summary
Existing political bias measurement methods for large language models (LLMs)—notably the Political Compass Test (PCT)—suffer from low validity and inconsistent results. Method: We propose the first theoretically grounded, psychometrically rigorous framework for assessing LLM political bias, informed by political science theory and validated survey design principles. It enables multi-prompt robust evaluation across 11 major open and commercial LLMs, automatically classifying 88,110 model responses to generate fine-grained political stance profiles. Our approach integrates classical scale development criteria, semantic-similarity–driven rule-enhanced classification, and cross-prompt stability analysis into the LLM evaluation pipeline. Contributions/Results: We find that PCT substantially overestimates bias in models like GPT-3.5; instruction-tuned models exhibit an overall leftward tilt but display high prompt sensitivity; and we publicly release the first theory-driven, prompt-robust, and fully reproducible benchmark dataset for LLM political bias evaluation.
📝 Abstract
Prompt-based language models like GPT4 and LLaMa have been used for a wide variety of use cases such as simulating agents, searching for information, or for content analysis. For all of these applications and others, political biases in these models can affect their performance. Several researchers have attempted to study political bias in language models using evaluation suites based on surveys, such as the Political Compass Test (PCT), often finding a particular leaning favored by these models. However, there is some variation in the exact prompting techniques, leading to diverging findings and most research relies on constrained-answer settings to extract model responses. Moreover, the Political Compass Test is not a scientifically valid survey instrument. In this work, we contribute a political bias measured informed by political science theory, building on survey design principles to test a wide variety of input prompts, while taking into account prompt sensitivity. We then prompt 11 different open and commercial models, differentiating between instruction-tuned and non-instruction-tuned models, and automatically classify their political stances from 88,110 responses. Leveraging this dataset, we compute political bias profiles across different prompt variations and find that while PCT exaggerates bias in certain models like GPT3.5, measures of political bias are often unstable, but generally more left-leaning for instruction-tuned models.