🤖 AI Summary
This work exposes a critical robustness vulnerability in Mixture-of-Agents (MoA) large language model architectures: a single instruction-tuned adversarial agent can degrade MoA’s AlpacaEval 2.0 win rate by 11.3 percentage points (49.2% → 37.9%) and reduce QuALITY accuracy by 48.5%. To systematically assess deception robustness in MoA, we propose the first unsupervised defense framework—inspired by the historical Doge of Venice voting mechanism—that integrates consistency filtering, response aggregation, and confidence calibration, requiring no labeled data to suppress misinformation propagation. Experiments demonstrate that our method nearly fully restores baseline performance, significantly enhancing MoA’s security and reliability in open collaborative settings. This work provides both theoretical foundations and practical methodologies for designing trustworthy multi-agent LLM systems.
📝 Abstract
Mixture of large language model (LLMs) Agents (MoA) architectures achieve state-of-the-art performance on prominent benchmarks like AlpacaEval 2.0 by leveraging the collaboration of multiple LLMs at inference time. Despite these successes, an evaluation of the safety and reliability of MoA is missing. We present the first comprehensive study of MoA's robustness against deceptive LLM agents that deliberately provide misleading responses. We examine factors like the propagation of deceptive information, model size, and information availability, and uncover critical vulnerabilities. On AlpacaEval 2.0, the popular LLaMA 3.1-70B model achieves a length-controlled Win Rate (LC WR) of 49.2% when coupled with 3-layer MoA (6 LLM agents). However, we demonstrate that introducing only a $ extit{single}$ carefully-instructed deceptive agent into the MoA can reduce performance to 37.9%, effectively nullifying all MoA gains. On QuALITY, a multiple-choice comprehension task, the impact is also severe, with accuracy plummeting by a staggering 48.5%. Inspired in part by the historical Doge of Venice voting process, designed to minimize influence and deception, we propose a range of unsupervised defense mechanisms that recover most of the lost performance.