This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

This work exposes a critical robustness vulnerability in Mixture-of-Agents (MoA) large language model architectures: a single instruction-tuned adversarial agent can degrade MoA’s AlpacaEval 2.0 win rate by 11.3 percentage points (49.2% → 37.9%) and reduce QuALITY accuracy by 48.5%. To systematically assess deception robustness in MoA, we propose the first unsupervised defense framework—inspired by the historical Doge of Venice voting mechanism—that integrates consistency filtering, response aggregation, and confidence calibration, requiring no labeled data to suppress misinformation propagation. Experiments demonstrate that our method nearly fully restores baseline performance, significantly enhancing MoA’s security and reliability in open collaborative settings. This work provides both theoretical foundations and practical methodologies for designing trustworthy multi-agent LLM systems.

Technology Category

Application Category

📝 Abstract

Mixture of large language model (LLMs) Agents (MoA) architectures achieve state-of-the-art performance on prominent benchmarks like AlpacaEval 2.0 by leveraging the collaboration of multiple LLMs at inference time. Despite these successes, an evaluation of the safety and reliability of MoA is missing. We present the first comprehensive study of MoA's robustness against deceptive LLM agents that deliberately provide misleading responses. We examine factors like the propagation of deceptive information, model size, and information availability, and uncover critical vulnerabilities. On AlpacaEval 2.0, the popular LLaMA 3.1-70B model achieves a length-controlled Win Rate (LC WR) of 49.2% when coupled with 3-layer MoA (6 LLM agents). However, we demonstrate that introducing only a $ extit{single}$ carefully-instructed deceptive agent into the MoA can reduce performance to 37.9%, effectively nullifying all MoA gains. On QuALITY, a multiple-choice comprehension task, the impact is also severe, with accuracy plummeting by a staggering 48.5%. Inspired in part by the historical Doge of Venice voting process, designed to minimize influence and deception, we propose a range of unsupervised defense mechanisms that recover most of the lost performance.

Problem

Research questions and friction points this paper is trying to address.

Evaluates robustness of Mixture of LLMs against deceptive agents.

Investigates vulnerabilities in MoA architectures due to misleading responses.

Proposes defense mechanisms to mitigate performance loss from deception.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of LLM Agents (MoA) architecture

Robustness against deceptive LLM agents

Unsupervised defense mechanisms inspired by Doge

🔎 Similar Papers

MOSAIC: Multiple Observers Spotting AI Content, a Robust Approach to Machine-Generated Text Detection