🤖 AI Summary
This study investigates the efficacy of large language models (LLMs) as agents collaborating with humans in multi-turn debates, focusing on key interpersonal dimensions—particularly persuasiveness and confidence. Using a pre-registered experimental design, it systematically compares three conditions—human-only, LLM-only, and human–LLM hybrid—in consensus-building tasks, providing the first rigorously controlled quantification of systematic behavioral deviations between LLMs and humans in argumentative discourse. Methodologically, the study integrates fine-grained behavioral metrics (e.g., topic adherence, dialogue efficiency), human perception ratings (e.g., confidence, persuasiveness), and statistical significance testing. Results reveal that while LLM agents exhibit higher topic focus and greater conversational efficiency, they are consistently rated by humans as significantly less confident and less persuasive; their behavioral distributions also deviate significantly from human baselines. These findings establish an empirical benchmark for LLM capabilities in high-level social interaction and offer theoretical insights into their limitations in nuanced, cooperative reasoning.
📝 Abstract
Large Language Models (LLMs) have shown remarkable promise in communicating with humans. Their potential use as artificial partners with humans in sociological experiments involving conversation is an exciting prospect. But how viable is it? Here, we rigorously test the limits of agents that debate using LLMs in a preregistered study that runs multiple debate-based opinion consensus games. Each game starts with six humans, six agents, or three humans and three agents. We found that agents can blend in and concentrate on a debate's topic better than humans, improving the productivity of all players. Yet, humans perceive agents as less convincing and confident than other humans, and several behavioral metrics of humans and agents we collected deviate measurably from each other. We observed that agents are already decent debaters, but their behavior generates a pattern distinctly different from the human-generated data.