๐ค AI Summary
Multi-agent large language model (LLM) systems are vulnerable to adversarial or low-performing agents, leading to unreliable outputs. To address this, we propose the first adversarial-robust framework for such systems, modeling collaborative question answering as an iterative game. Our method innovatively introduces a history-aware mechanism that adaptively learns agent trustworthiness and performs trust-weighted output aggregation. It integrates four key components: trustworthiness modeling, game-theoretic collaboration, robust aggregation, and adversarial training-based evaluation. Extensive experiments across diverse tasks and settings demonstrate that our framework significantly mitigates malicious interference, improving both accuracy and output stability. Notably, it maintains high performance even under extreme conditions where adversarial agents constitute over 50% of the systemโmarking the first empirical validation of robustness in majority-adversary regimes.
๐ Abstract
While multi-agent LLM systems show strong capabilities in various domains, they are highly vulnerable to adversarial and low-performing agents. To resolve this issue, in this paper, we introduce a general and adversary-resistant multi-agent LLM framework based on credibility scoring. We model the collaborative query-answering process as an iterative game, where the agents communicate and contribute to a final system output. Our system associates a credibility score that is used when aggregating the team outputs. The credibility scores are learned gradually based on the past contributions of each agent in query answering. Our experiments across multiple tasks and settings demonstrate our system's effectiveness in mitigating adversarial influence and enhancing the resilience of multi-agent cooperation, even in the adversary-majority settings.