🤖 AI Summary
Existing LLM political bias evaluations suffer from unstable prompting strategies, undermining cross-model comparability. To address this, we propose the Questionnaire Modeling (QM) task—introducing real human survey data as in-context examples for bias assessment, thereby enhancing response consistency and comparability via in-context learning. We innovatively design a QM framework that systematically reveals how instruction tuning modulates bias directionality. Our analysis further uncovers that model scale positively correlates with contextual utilization capability, and larger models exhibit lower, more stable bias scores under QM. Experiments across multiple LLM sizes demonstrate that QM significantly improves evaluation stability. This work establishes a novel, reproducible, and interpretable paradigm for assessing political bias in large language models.
📝 Abstract
A growing body of work has been querying LLMs with political questions to evaluate their potential biases. However, this probing method has limited stability, making comparisons between models unreliable. In this paper, we argue that LLMs need more context. We propose a new probing task, Questionnaire Modeling (QM), that uses human survey data as in-context examples. We show that QM improves the stability of question-based bias evaluation, and demonstrate that it may be used to compare instruction-tuned models to their base versions. Experiments with LLMs of various sizes indicate that instruction tuning can indeed change the direction of bias. Furthermore, we observe a trend that larger models are able to leverage in-context examples more effectively, and generally exhibit smaller bias scores in QM. Data and code are publicly available.