A Mega-Study of Digital Twins Reveals Strengths, Weaknesses and Opportunities for Further Improvement

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This study rigorously evaluates whether large language model (LLM)-driven digital twins can accurately represent real individuals’ and populations’ behavioral responses. Method: Across 19 preregistered studies, we benchmarked LLM-based digital twins against nationally representative U.S. survey data (N = 164 behavioral metrics), assessing consistency in individual-level predictions, population-level means, and response variability. Results: Digital twins exhibit only modest fidelity in capturing inter-individual rank-order differences (mean r = 0.2), show limited accuracy in predicting individual responses or aggregate means, and systematically underestimate response variability. Performance is significantly moderated by education, income, and ideological moderation, and varies substantially across domains. Critically, this work provides the first large-scale, empirically grounded behavioral validity benchmark for digital twins—integrating longitudinal individual histories, multi-domain surveys, and strict preregistration—thereby delineating current practical boundaries and establishing a foundational validation framework and actionable pathways for developing trustworthy digital twins.

Technology Category

Application Category

📝 Abstract

Do "digital twins" capture individual responses in surveys and experiments? We run 19 pre-registered studies on a national U.S. panel and their LLM-powered digital twins (constructed based on previously-collected extensive individual-level data) and compare twin and human answers across 164 outcomes. The correlation between twin and human answers is modest (approximately 0.2 on average) and twin responses are less variable than human responses. While constructing digital twins based on rich individual-level data improves our ability to capture heterogeneity across participants and predict relative differences between them, it does not substantially improve our ability to predict the exact answers given by specific participants or enhance predictions of population means. Twin performance varies by domain and is higher among more educated, higher-income, and ideologically moderate participants. These results suggest current digital twins can capture some degree of relative differences but are unreliable for individual-level predictions and sample mean and variance estimation, underscoring the need for careful validation before use. Our data and code are publicly available for researchers and practitioners interested in optimizing digital twin pipelines.

Problem

Research questions and friction points this paper is trying to address.

Evaluating how well digital twins capture individual survey responses

Assessing correlation and variability between twin and human answers

Identifying limitations in predicting exact individual responses and population statistics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed LLM-powered digital twins from individual-level data

Compared twin and human responses across 164 survey outcomes

Evaluated twin performance across demographic domains and variability

🔎 Similar Papers

No similar papers found.

OpenAI

$380K – $445K • Offers Equity

San Francisco, CA, USA

Senior Applied Researcher

Adobe

San Jose, California, United States of America

Research Engineer, Language - Personalization, Meta Superintelligence Labs