🤖 AI Summary
This study investigates the reliability of large language models (LLMs) in replicating real individual behaviors for health opinion surveys and decision-making simulations. Leveraging empirical data from the Understanding America Study (UAS), we develop a demographically grounded digital twin agent framework to systematically evaluate behavioral consistency and bias patterns across Llama 3, GPT-series, and other LLMs along dimensions such as race and income. Our analysis reveals, for the first time, that LLMs not only reproduce observed societal biases but also introduce novel biases absent in UAS data. We find Llama 3 better captures decision heterogeneity, whereas several models significantly overestimate vaccine acceptance—posing critical risks for public health modeling. Crucially, we demonstrate that prompting strategies themselves constitute a source of bias. These findings provide methodological cautions and establish foundational evaluation benchmarks for LLM-driven social behavior simulation.
📝 Abstract
Generative agents have been increasingly used to simulate human behaviour in silico, driven by large language models (LLMs). These simulacra serve as sandboxes for studying human behaviour without compromising privacy or safety. However, it remains unclear whether such agents can truly represent real individuals. This work compares survey data from the Understanding America Study (UAS) on healthcare decision-making with simulated responses from generative agents. Using demographic-based prompt engineering, we create digital twins of survey respondents and analyse how well different LLMs reproduce real-world behaviours. Our findings show that some LLMs fail to reflect realistic decision-making, such as predicting universal vaccine acceptance. However, Llama 3 captures variations across race and Income more accurately but also introduces biases not present in the UAS data. This study highlights the potential of generative agents for behavioural research while underscoring the risks of bias from both LLMs and prompting strategies.