LLM Generated Persona is a Promise with a Catch

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This paper addresses the methodological lack of rigor and insufficient fidelity in large language model (LLM)-generated personas. We propose the first verifiable, auditable scientific framework for persona generation. Through systematic comparison of over one million synthetically generated personas against high-quality real-world population surveys (ANES/CCES), we uncover substantial systematic biases in existing heuristic approaches—particularly along politically and socially salient dimensions—thereby undermining the reliability of downstream applications such as election forecasting. Methodologically, our framework integrates state-of-the-art LLM-based persona synthesis, multi-dimensional behavioral consistency evaluation, and alignment with empirically grounded benchmarks, while quantifying sources of bias. Key contributions include: (1) establishing empirical rigor as a foundational requirement for persona generation; (2) releasing a publicly available, million-scale, human-annotated persona dataset; and (3) providing a new paradigm and infrastructure for trustworthy, reproducible social simulation research.

Technology Category

Application Category

📝 Abstract

The use of large language models (LLMs) to simulate human behavior has gained significant attention, particularly through personas that approximate individual characteristics. Persona-based simulations hold promise for transforming disciplines that rely on population-level feedback, including social science, economic analysis, marketing research, and business operations. Traditional methods to collect realistic persona data face significant challenges. They are prohibitively expensive and logistically challenging due to privacy constraints, and often fail to capture multi-dimensional attributes, particularly subjective qualities. Consequently, synthetic persona generation with LLMs offers a scalable, cost-effective alternative. However, current approaches rely on ad hoc and heuristic generation techniques that do not guarantee methodological rigor or simulation precision, resulting in systematic biases in downstream tasks. Through extensive large-scale experiments including presidential election forecasts and general opinion surveys of the U.S. population, we reveal that these biases can lead to significant deviations from real-world outcomes. Our findings underscore the need to develop a rigorous science of persona generation and outline the methodological innovations, organizational and institutional support, and empirical foundations required to enhance the reliability and scalability of LLM-driven persona simulations. To support further research and development in this area, we have open-sourced approximately one million generated personas, available for public access and analysis at https://huggingface.co/datasets/Tianyi-Lab/Personas.

Problem

Research questions and friction points this paper is trying to address.

LLM-generated personas lack methodological rigor and precision

Current persona generation causes biases in real-world simulations

Need reliable scalable methods for synthetic persona creation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-generated synthetic personas for scalable simulations

Methodological rigor to reduce systematic biases

Open-source dataset for public research access

🔎 Similar Papers

No similar papers found.