Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) replicate human-like response biases—particularly normative value biases—in survey contexts, and examines their robustness to prompt perturbations. Method: We systematically design 11 prompt perturbations—including option ordering, semantic paraphrasing, and compound variations—and evaluate nine state-of-the-art LLMs on World Values Survey items across over 167,000 simulated interviews. Contribution/Results: We first identify a pervasive “recency bias” in LLMs—systematic preference for last-listed options—and demonstrate that semantic and compound perturbations elicit human-like response patterns. Although LLMs exhibit greater overall robustness than humans, all models remain highly sensitive to subtle prompt details. These findings establish LLMs as bias-aware survey agents, underscore the critical role of prompt engineering in ensuring the validity of synthetic social data, and provide a methodological benchmark and reliability caution for AI-driven social science research.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known response biases are poorly understood. This paper investigates the response robustness of LLMs in normative survey contexts -- we test nine diverse LLMs on questions from the World Values Survey (WVS), applying a comprehensive set of 11 perturbations to both question phrasing and answer option structure, resulting in over 167,000 simulated interviews. In doing so, we not only reveal LLMs' vulnerabilities to perturbations but also reveal that all tested models exhibit a consistent extit{recency bias} varying in intensity, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. By applying a set of perturbations, we reveal that LLMs partially align with survey response biases identified in humans. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM reliability in social science surveys

Identifying recency bias in LLM survey responses

Evaluating LLM sensitivity to prompt perturbations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Testing LLMs with 11 prompt perturbations

Revealing recency bias in survey responses

Assessing robustness to semantic variations

🔎 Similar Papers

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation