🤖 AI Summary
This study investigates whether collaborative gains—defined as collective accuracy exceeding individual averages—emerge in both human-AI and LLM-only dialogues. We compare human-only, LLM-only, and hybrid groups on open-ended question answering, quantifying knowledge synergy through confidence calibration and answer revision behavior. Results show that hybrid human-AI groups significantly outperform both baselines, whereas LLM-only groups suffer performance degradation due to knowledge homogeneity. Crucially, collaboration gains are driven by epistemic diversity—not individual model accuracy—highlighting diversity as the primary mechanism underlying effective knowledge integration. These findings challenge the conventional accuracy-centric paradigm in AI system design and propose a “diversity-first” framework for collaborative AI. The work provides both theoretical grounding and empirical evidence for engineering AI systems capable of robust human-AI co-reasoning.
📝 Abstract
Conversations transform individual knowledge into collective insight, allowing groups of humans and increasingly groups of artificial intelligence (AI) agents to collaboratively solve complex problems. Whether interactions between AI agents can replicate the synergy observed in human discussions remains an open question. To investigate this, we systematically compared four conversational configurations: pairs of large language models (LLM-LLM), trios of LLMs, trios of humans, and mixed human-LLM pairs. After agents answered questions individually, they engaged in open-ended discussions and then reconsidered their initial answers. Interactions involving humans consistently led to accuracy improvements after the conversations, benefiting both stronger and weaker participants. By contrast, purely LLM-based pairs and trios exhibited declines in accuracy, demonstrating limited conversational synergy. Analysis of participants'confidence and answer-switching behavior revealed that knowledge diversity is a critical factor enabling collaborative improvement. Crucially, the lack of gains in LLM-LLM interactions did not stem from a fundamental limitation of the models'ability to collaborate, but from highly similar knowledge states that left little room for productive exchange. Our findings argue for a paradigm shift in AI development: rather than optimizing individual models solely for standalone performance, explicitly cultivating diversity across agents, even at the cost of slightly lower individual accuracy, may yield AI collaborators that are more effective in group settings with humans or other AI systems.