Beyond the Strongest LLM: Multi-Turn Multi-Agent Orchestration vs. Single LLMs on Benchmarks

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether multi-round, multi-agent collaboration can surpass the performance of individual large language models (LLMs) on complex reasoning tasks. Method: We propose a multi-round interactive framework wherein heterogeneous LLM agents—Gemini 2.5 Pro, GPT-5, Grok-4, and Claude Sonnet 4—collaboratively solve problems via iterative answer generation and consensus-based voting. Experiments are conducted on GPQA-Diamond, IFEval, and MuSR benchmarks, complemented by systematic ablation studies. Contribution/Results: Our approach matches or exceeds the strongest single-model baseline across all benchmarks and significantly outperforms established multi-agent baselines. Crucially, we identify—for the first time—that disclosing agent identities and making votes visible induce conformity bias and premature convergence, revealing the double-edged nature of information transparency in collective decision-making. These findings provide novel theoretical insights and an extensible paradigm for designing robust multi-agent collaboration mechanisms.

Technology Category

Application Category

📝 Abstract
We study multi-turn multi-agent orchestration, where multiple large language model (LLM) agents interact over multiple turns by iteratively proposing answers or casting votes until reaching consensus. Using four LLMs (Gemini 2.5 Pro, GPT-5, Grok 4, and Claude Sonnet 4) on GPQA-Diamond, IFEval, and MuSR, we conduct two experiments: (i) benchmarking orchestration against single-LLM baselines; and (ii) ablations on GPQA-Diamond that vary whether agents see who authored answers and whether they can observe ongoing votes. Orchestration matches or exceeds the strongest single model and consistently outperforms the others. Analysis of best-achievable orchestration performance shows potential for further gains. The ablations show that revealing authorship increases self-voting and ties, and that showing ongoing votes amplifies herding, which speeds convergence but can sometimes yield premature consensus.
Problem

Research questions and friction points this paper is trying to address.

Comparing multi-agent orchestration performance against single LLM baselines
Investigating how authorship visibility affects voting behavior and ties
Analyzing how real-time vote display influences herding and consensus speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent orchestration outperforms single LLMs
Multiple LLM agents interact iteratively for consensus
Ablations reveal authorship and voting effects
🔎 Similar Papers
No similar papers found.