Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the evaluation of anthropomorphic behavior in large language models (LLMs) during authentic multi-turn interactions and its impact on user perception. We propose the first real-world-oriented, multi-turn anthropomorphism assessment framework, systematically measuring 14 behavioral categories (e.g., empathy, self-reference) via a validated behavioral coding protocol, user interaction simulation, and a large-scale online experiment (N=1101). Our contributions are fourfold: (1) introducing a novel dynamic, multi-turn evaluation paradigm; (2) developing a scalable, automated simulation-based assessment methodology; (3) empirically demonstrating that anthropomorphic behaviors predominantly emerge after ≥3 turns (85%+ prevalence) and exhibit a consistent, relationship-building–centric pattern across state-of-the-art LLMs; and (4) establishing that behavioral metrics significantly predict users’ anthropomorphism perceptions (p<0.001), confirming strong psychological validity.

Technology Category

Application Category

📝 Abstract
The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.
Problem

Research questions and friction points this paper is trying to address.

Evaluating anthropomorphic behaviors in LLMs
Multi-turn interactive assessment methods
Influence of design on user anthropomorphism
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-turn evaluation method
automated user interaction simulations
large-scale human subject study
🔎 Similar Papers
No similar papers found.