π€ AI Summary
This study addresses the phenomenon of βrole collapseβ in large language models (LLMs) within multi-agent simulations, where distinct character roles converge toward homogeneous behaviors, undermining behavioral diversity. The work introduces this concept for the first time and proposes a quantitative evaluation framework encompassing coverage, uniformity, and complexity. Leveraging the BFI-44 personality inventory, moral reasoning tasks, and self-introduction generation, the authors systematically validate the prevalence of role collapse across ten mainstream LLMs. Findings reveal that behavioral differences among generated agents stem predominantly from coarse-grained demographic stereotypes rather than individualized traits, and that higher-fidelity generation often exacerbates stereotyping. The project releases an open-source evaluation toolkit and dataset to establish benchmarks and diagnostic tools for enhancing role diversity in LLM-driven agents.
π Abstract
Applications based on large language models (LLMs), such as multi-agent simulations, require population diversity among agents. We identify a pervasive failure mode we term \emph{Persona Collapse}: agents each assigned a distinct profile nonetheless converge into a narrow behavioral mode, producing a homogeneous simulated population. To quantify persona collapse, we propose a framework that measures how much of the persona space a population occupies (Coverage), how evenly agents spread across it (Uniformity), and how rich the resulting behavioral patterns are (Complexity). Evaluating ten LLMs on personality simulation (BFI-44), moral reasoning, and self-introduction, we observe persona collapse along two axes: (1) Dimensions: a model can appear diverse on one axis yet structurally degenerate on another, and (2) Domains: the same model may collapse the most in personality yet be the most diverse in moral reasoning. Furthermore, item-level diagnostics reveal that behavioral variation tracks coarse demographic stereotypes rather than the fine-grained individual differences specified in each persona. Counter-intuitively, \textbf{the models achieving the highest per-persona fidelity consistently produce the most stereotyped populations}. We release our toolkit and data to support population-level evaluation of LLMs.