🤖 AI Summary
This study investigates whether large language models (LLMs) can reliably generate demographically and attitudinally representative “silicon samples” to substitute for human respondents in social surveys. Method: Leveraging GPT-4, we construct repeated random sampling distributions and systematically quantify output biases across demographic parameters (sex, age, race, education) and attitudinal scales (e.g., political orientation), benchmarking against the 2020 U.S. Census and validated survey data. Contribution/Results: GPT-4 samples approximate census-level sex ratios and mean age but significantly overrepresent Black individuals and highly educated groups. Attitudinal responses exhibit high certainty, non-ideological patterning, near-normal distributional shape, and critically—lack genuine inter-individual variability, diverging fundamentally from human distributions. We introduce the first sampling-distribution-based framework for evaluating LLM-generated social responses, empirically characterizing systematic biases and providing both methodological foundations and empirical cautions for the circumspect use of LLMs in social science research.
📝 Abstract
Recent discussions about Large Language Models (LLMs) indicate that they have the potential to simulate human responses in social surveys and generate reliable predictions, such as those found in political polls. However, the existing findings are highly inconsistent, leaving us uncertain about the population characteristics of data generated by LLMs. In this paper, we employ repeated random sampling to create sampling distributions that identify the population parameters of silicon samples generated by GPT. Our findings show that GPT's demographic distribution aligns with the 2020 U.S. population in terms of gender and average age. However, GPT significantly overestimates the representation of the Black population and individuals with higher levels of education, even when it possesses accurate knowledge. Furthermore, GPT's point estimates for attitudinal scores are highly inconsistent and show no clear inclination toward any particular ideology. The sample response distributions exhibit a normal pattern that diverges significantly from those of human respondents. Consistent with previous studies, we find that GPT's answers are more deterministic than those of humans. We conclude by discussing the concerning implications of this biased and deterministic silicon population for making inferences about real-world populations.