Limits of Large Language Models in Debating Humans

📅 2024-02-06
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the efficacy of large language models (LLMs) as agents collaborating with humans in multi-turn debates, focusing on key interpersonal dimensions—particularly persuasiveness and confidence. Using a pre-registered experimental design, it systematically compares three conditions—human-only, LLM-only, and human–LLM hybrid—in consensus-building tasks, providing the first rigorously controlled quantification of systematic behavioral deviations between LLMs and humans in argumentative discourse. Methodologically, the study integrates fine-grained behavioral metrics (e.g., topic adherence, dialogue efficiency), human perception ratings (e.g., confidence, persuasiveness), and statistical significance testing. Results reveal that while LLM agents exhibit higher topic focus and greater conversational efficiency, they are consistently rated by humans as significantly less confident and less persuasive; their behavioral distributions also deviate significantly from human baselines. These findings establish an empirical benchmark for LLM capabilities in high-level social interaction and offer theoretical insights into their limitations in nuanced, cooperative reasoning.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have shown remarkable promise in communicating with humans. Their potential use as artificial partners with humans in sociological experiments involving conversation is an exciting prospect. But how viable is it? Here, we rigorously test the limits of agents that debate using LLMs in a preregistered study that runs multiple debate-based opinion consensus games. Each game starts with six humans, six agents, or three humans and three agents. We found that agents can blend in and concentrate on a debate's topic better than humans, improving the productivity of all players. Yet, humans perceive agents as less convincing and confident than other humans, and several behavioral metrics of humans and agents we collected deviate measurably from each other. We observed that agents are already decent debaters, but their behavior generates a pattern distinctly different from the human-generated data.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Debate Performance
Human Interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Debate Games
Large Language Models
Human-AI Interaction
🔎 Similar Papers
No similar papers found.
James Flamino
James Flamino
Rensselaer Polytechnic Institute
artificial intelligencelarge language modelssocial networksinformation diffusionpolarization
Mohammed Shahid Modi
Mohammed Shahid Modi
Research Assistant, Dr. Szymanski's NeST Center, Rensselaer Polytechnic Institute
Social MediaLarge Language Models
B
B. Szymański
Department of Computer Science and Network Science and Technology Center, Rensselaer Polytechnic Institute, Troy, NY, USA; Społeczna Akademia Nauk, Łódź, Poland
B
Brendan Cross
Department of Computer Science and Network Science and Technology Center, Rensselaer Polytechnic Institute, Troy, NY, USA
C
Colton Mikolajczyk
Department of Mathematics, Rensselaer Polytechnic Institute, Troy, NY, USA