π€ AI Summary
This study investigates whether large language models (LLMs) can authentically replicate human behavioral patterns in conflict resolution as shaped by personality traits. To this end, we propose the first interpretable evaluation framework for aligning AI behavior with human behavior, grounded in the Big Five personality model. We construct a dialogue dataset that pairs specific personality profiles with conflict scenarios and employ quantifiable behavioral metrics to compare LLMsβ strategic choices and conflict outcomes against those of humans. Experimental results reveal significant discrepancies between current mainstream LLMs and human behavior in personality-driven conflict interactions, raising critical concerns about the reliability of these models as behavioral proxies in social applications.
π Abstract
Large language models (LLMs) are increasingly used to simulate human behavior in social settings such as legal mediation, negotiation, and dispute resolution. However, it remains unclear whether these simulations reproduce the personality-behavior patterns observed in humans. Human personality, for instance, shapes how individuals navigate social interactions, including strategic choices and behaviors in emotionally charged interactions. This raises the question: Can LLMs, when prompted with personality traits, reproduce personality-driven differences in human conflict behavior? To explore this, we introduce an evaluation framework that enables direct comparison of human-human and LLM-LLM behaviors in dispute resolution dialogues with respect to Big Five Inventory (BFI) personality traits. This framework provides a set of interpretable metrics related to strategic behavior and conflict outcomes. We additionally contribute a novel dataset creation methodology for LLM dispute resolution dialogues with matched scenarios and personality traits with respect to human conversations. Finally, we demonstrate the use of our evaluation framework with three contemporary closed-source LLMs and show significant divergences in how personality manifests in conflict across different LLMs compared to human data, challenging the assumption that personality-prompted agents can serve as reliable behavioral proxies in socially impactful applications. Our work highlights the need for psychological grounding and validation in AI simulations before real-world use.