Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare

📅 2025-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the reliability of large language models (LLMs) in replicating real individual behaviors for health opinion surveys and decision-making simulations. Leveraging empirical data from the Understanding America Study (UAS), we develop a demographically grounded digital twin agent framework to systematically evaluate behavioral consistency and bias patterns across Llama 3, GPT-series, and other LLMs along dimensions such as race and income. Our analysis reveals, for the first time, that LLMs not only reproduce observed societal biases but also introduce novel biases absent in UAS data. We find Llama 3 better captures decision heterogeneity, whereas several models significantly overestimate vaccine acceptance—posing critical risks for public health modeling. Crucially, we demonstrate that prompting strategies themselves constitute a source of bias. These findings provide methodological cautions and establish foundational evaluation benchmarks for LLM-driven social behavior simulation.

Technology Category

Application Category

📝 Abstract
Generative agents have been increasingly used to simulate human behaviour in silico, driven by large language models (LLMs). These simulacra serve as sandboxes for studying human behaviour without compromising privacy or safety. However, it remains unclear whether such agents can truly represent real individuals. This work compares survey data from the Understanding America Study (UAS) on healthcare decision-making with simulated responses from generative agents. Using demographic-based prompt engineering, we create digital twins of survey respondents and analyse how well different LLMs reproduce real-world behaviours. Our findings show that some LLMs fail to reflect realistic decision-making, such as predicting universal vaccine acceptance. However, Llama 3 captures variations across race and Income more accurately but also introduces biases not present in the UAS data. This study highlights the potential of generative agents for behavioural research while underscoring the risks of bias from both LLMs and prompting strategies.
Problem

Research questions and friction points this paper is trying to address.

Assessing bias in LLMs for healthcare opinion surveys
Comparing real vs simulated healthcare decision-making data
Evaluating demographic accuracy and bias in LLM responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using demographic-based prompt engineering
Creating digital twins of respondents
Comparing real and simulated survey responses
🔎 Similar Papers
No similar papers found.
Yonchanok Khaokaew
Yonchanok Khaokaew
KMUTNB, UNSW
Computer Science
F
Flora D. Salim
University of New South Wales, Australia
A
Andreas Zufle
Emory University, USA
Hao Xue
Hao Xue
University of New South Wales
human mobilityspatio-temporal data mining
Taylor Anderson
Taylor Anderson
Assistant Professor
M
M. Scotch
Arizona State University, USA
D
David J. Heslop
University of New South Wales, Australia