Evaluating LLM-Simulated Conversations in Modeling Inconsistent and Uncollaborative Behaviors in Human Social Interaction

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of large language models (LLMs) in authentically simulating human-like conversational behaviors, particularly inconsistent and non-collaborative phenomena such as misunderstandings and interruptions. To this end, the authors propose CoCoEval, a novel evaluation framework that systematically assesses LLMs across ten fine-grained categories of non-collaborative behavior using an LLM-as-a-Judge approach in diverse dialogue settings—including academic discussions, business meetings, government proceedings, and debates. The study reveals that off-the-shelf LLMs significantly underestimate the prevalence of such behaviors, prompt engineering yields inconsistent improvements, and supervised fine-tuning often leads to over-generation of specific behaviors like repetition. These findings highlight fundamental challenges in modeling and controlling complex social interaction dynamics with current LLMs.

Technology Category

Application Category

📝 Abstract
Simulating human conversations using large language models (LLMs) has emerged as a scalable methodology for modeling human social interaction. However, simulating human conversations is challenging because they inherently involve inconsistent and uncollaborative behaviors, such as misunderstandings and interruptions. Analysis comparing inconsistent and uncollaborative behaviors in human- and LLM-generated conversations remains limited, although reproducing these behaviors is integral to simulating human-like and complex social interaction. In this work, we introduce CoCoEval, an evaluation framework that analyzes LLM-simulated conversations by detecting 10 types of inconsistent and uncollaborative behaviors at the turn level using an LLM-as-a-Judge. Using CoCoEval, we evaluate GPT-4.1, GPT-5.1, and Claude Opus 4 by comparing the frequencies of detected behaviors in conversations simulated by each model and in human conversations across academic, business, and governmental meetings, as well as debates. Our analysis shows that (1) under vanilla prompting, LLM-simulated conversations exhibit far fewer inconsistent and uncollaborative behaviors than human conversations; (2) prompt engineering does not provide reliable control over these behaviors, as our results show that different prompts lead to their under- or overproduction; and (3) supervised fine-tuning on human conversations can lead LLMs to overproduce a narrow set of behaviors, such as repetition. Our findings highlight the difficulty of simulating human conversations, raising concerns about the use of LLMs as a proxy for human social interaction.
Problem

Research questions and friction points this paper is trying to address.

large language models
human social interaction
inconsistent behaviors
uncollaborative behaviors
conversation simulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-simulated conversations
inconsistent behaviors
uncollaborative behaviors
CoCoEval
LLM-as-a-Judge
🔎 Similar Papers
No similar papers found.