🤖 AI Summary
This study rigorously evaluates whether large language model (LLM)-driven digital twins can accurately represent real individuals’ and populations’ behavioral responses. Method: Across 19 preregistered studies, we benchmarked LLM-based digital twins against nationally representative U.S. survey data (N = 164 behavioral metrics), assessing consistency in individual-level predictions, population-level means, and response variability. Results: Digital twins exhibit only modest fidelity in capturing inter-individual rank-order differences (mean r = 0.2), show limited accuracy in predicting individual responses or aggregate means, and systematically underestimate response variability. Performance is significantly moderated by education, income, and ideological moderation, and varies substantially across domains. Critically, this work provides the first large-scale, empirically grounded behavioral validity benchmark for digital twins—integrating longitudinal individual histories, multi-domain surveys, and strict preregistration—thereby delineating current practical boundaries and establishing a foundational validation framework and actionable pathways for developing trustworthy digital twins.
📝 Abstract
Do "digital twins" capture individual responses in surveys and experiments? We run 19 pre-registered studies on a national U.S. panel and their LLM-powered digital twins (constructed based on previously-collected extensive individual-level data) and compare twin and human answers across 164 outcomes. The correlation between twin and human answers is modest (approximately 0.2 on average) and twin responses are less variable than human responses. While constructing digital twins based on rich individual-level data improves our ability to capture heterogeneity across participants and predict relative differences between them, it does not substantially improve our ability to predict the exact answers given by specific participants or enhance predictions of population means. Twin performance varies by domain and is higher among more educated, higher-income, and ideologically moderate participants. These results suggest current digital twins can capture some degree of relative differences but are unreliable for individual-level predictions and sample mean and variance estimation, underscoring the need for careful validation before use. Our data and code are publicly available for researchers and practitioners interested in optimizing digital twin pipelines.