🤖 AI Summary
Can collaborative orchestration of multiple open-source large language models (LLMs) systematically outperform state-of-the-art closed-source models?
Method: This paper introduces SMACS, a scalable multi-agent collaboration system featuring a novel “retrieval-based prior selection + exploration-exploitation-driven posterior enhancement” mechanism. SMACS dynamically selects optimal models and generates high-quality, diverse outputs by integrating retrieval-augmented model selection, agent scoring, prior pruning, and hybrid posterior scoring across 15 open-source LLMs.
Contribution/Results: On eight mainstream benchmarks, SMACS significantly surpasses leading 2025 closed-source models—achieving +12.73% over Claude-3.7-Sonnet and +5.36% over GPT-4.1—while establishing a new average performance ceiling across both open- and closed-source models. This work provides the first systematic empirical validation that coordinated open-source LLMs can collectively transcend fundamental limitations of individual models.
📝 Abstract
This paper aims to demonstrate the potential and strengths of open-source collectives. It leads to a promising question: Can we harness multiple open-source LLMs to match or even beat the closed-source LLMs? To answer this, we propose SMACS, a scalable multi-agent collaboration system (MACS) framework with high performance. Specifically, for continuous integration of new LLMs and generalization to diverse questions, we first propose a Retrieval-based Prior Selection (RPS), which assigns a proxy performance score to each LLM to select the Top-k LLMs at the instance level for any given question. Then, we propose an Exploration-Exploitation-Driven Posterior Enhancement (EPE), encouraging the generation of diverse responses through prior dropping and selecting the high-quality response via a hybrid posterior score. Experiments on eight mainstream benchmarks validate the effectiveness of our SMACS: by integrating fifteen open-source LLMs, SMACS outperforms leading closed-source LLMs in 2025, e.g., Claude-3.7-Sonnet (+12.73%), GPT-4.1(+5.36%) and GPT-o3-mini(+5.28%) across multiple tasks. Remarkably, it even exceeds the average of best results of different datasets from both open-source LLMs (+2.86%) and closed-source LLMs (+2.04%), pushing the upper bound of intelligence. Code will be released at https://github.com/magent4aci/SMACS.