Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare

📅 2025-04-11

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study investigates the reliability of large language models (LLMs) in replicating real individual behaviors for health opinion surveys and decision-making simulations. Leveraging empirical data from the Understanding America Study (UAS), we develop a demographically grounded digital twin agent framework to systematically evaluate behavioral consistency and bias patterns across Llama 3, GPT-series, and other LLMs along dimensions such as race and income. Our analysis reveals, for the first time, that LLMs not only reproduce observed societal biases but also introduce novel biases absent in UAS data. We find Llama 3 better captures decision heterogeneity, whereas several models significantly overestimate vaccine acceptance—posing critical risks for public health modeling. Crucially, we demonstrate that prompting strategies themselves constitute a source of bias. These findings provide methodological cautions and establish foundational evaluation benchmarks for LLM-driven social behavior simulation.

Technology Category

Application Category

📝 Abstract

Generative agents have been increasingly used to simulate human behaviour in silico, driven by large language models (LLMs). These simulacra serve as sandboxes for studying human behaviour without compromising privacy or safety. However, it remains unclear whether such agents can truly represent real individuals. This work compares survey data from the Understanding America Study (UAS) on healthcare decision-making with simulated responses from generative agents. Using demographic-based prompt engineering, we create digital twins of survey respondents and analyse how well different LLMs reproduce real-world behaviours. Our findings show that some LLMs fail to reflect realistic decision-making, such as predicting universal vaccine acceptance. However, Llama 3 captures variations across race and Income more accurately but also introduces biases not present in the UAS data. This study highlights the potential of generative agents for behavioural research while underscoring the risks of bias from both LLMs and prompting strategies.

Problem

Research questions and friction points this paper is trying to address.

Assessing bias in LLMs for healthcare opinion surveys

Comparing real vs simulated healthcare decision-making data

Evaluating demographic accuracy and bias in LLM responses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using demographic-based prompt engineering

Creating digital twins of respondents

Comparing real and simulated survey responses

🔎 Similar Papers

No similar papers found.