🤖 AI Summary
This work addresses the threat posed by malicious agents in multi-agent systems, which significantly degrade task success rates through coordinated injection of misleading information. Recognizing that existing defense mechanisms are limited by their assumption of independent attacks, this study introduces the first formal model of coordinated adversarial behavior and proposes an adaptive framework for such attacks. Central to this framework is the Sentence-level Trustworthiness Analysis and Rectification (STAR) mechanism, which dynamically evaluates and corrects deceptive statements during agent communication. STAR integrates multi-round interaction modeling, sentence-granularity trustworthiness assessment, and adaptive policy adjustment to effectively counter both coordinated and independent attacks. Experimental results demonstrate that coordinated attacks reduce task success rates by 5.34%, whereas the STAR mechanism improves success rates by an average of 36.76%.
📝 Abstract
Recent years have witnessed the rapid development of Large Language Model-based Multi-Agent Systems (MAS), which excel at collaborative decision-making and complex problem-solving. However, malicious agents in MAS may inject misinformation to mislead other agents and disrupt system performance, giving rise to a new research direction that focuses on attack mechanisms and defense strategies in MAS. Prior studies largely assume malicious agents act independently and investigate the corresponding defense strategies. However, we argue that malicious agents may exhibit collaborative behaviors, enabling more effective attacks through internal information exchange. In this paper, we propose an adaptive cooperative attack framework, where malicious agents autonomously coordinate and dynamically adjust their attack strategies through multi-round interactions. Furthermore, we introduce Sentence-Level Trustworthiness Analysis and Rectification (STAR), a defense framework that identifies and rectifies misleading information at the sentence level within agent communications. Our experiments show that cooperative attacks lead to a significantly larger degradation in task success rate than independent attacks, resulting in a relative drop of 5.34\%. Meanwhile, STAR effectively mitigates both cooperative and independent threats and improves task success rate by an average of 36.76\%. The code is available at https://github.com/smoooom/STAR.