When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the vulnerability of large language model (LLM)-driven multi-agent systems to malicious agents that manipulate collective decisions by disseminating misleading information. While existing defenses rely on embedding-based detection, they are easily circumvented. The paper identifies fundamental limitations of such embedding-centric approaches and proposes a novel defense mechanism that incorporates token-level confidence signals—such as logits—into multi-agent communication protocols. By integrating embedding analysis with confidence scoring, the method enables early intervention through message pruning or down-weighting during agent interactions. Evaluated across diverse models, datasets, and communication topologies, this approach significantly enhances system robustness and effectively mitigates sophisticated attacks including Slow Drift, Benign Wrapper, and Chaos Seeding.

📝 Abstract

Large language model (LLM)-powered multi-agent systems (MAS) enable agents to communicate and share information, achieving strong performance on complex tasks. However, this communication also creates an attack surface where malicious agents can propagate misinformation and manipulate group decisions, undermining MAS safety. Existing embedding-based defenses aim to detect and prune suspicious agents, but their effectiveness depends on a clear separation between the text embeddings of malicious and benign messages. Attackers can circumvent such defenses by crafting messages whose embeddings lie close to benign ones. We analyze this failure mode theoretically and validate it empirically with three attacks, Slow Drift, Benign Wrapper, and Chaos Seeding. Our analysis further reveals a fundamental limitation of embedding-based defenses: because they rely solely on the text embeddings, they ignore token-level confidence signals such as logits, which can remain informative when embeddings are not distinguishable under attack. We propose using confidence scores to prune or down-weight messages during MAS communication. Experiments show improved robustness across models, datasets, and communication topologies. Moreover, we find that the effectiveness of confidence signals decays over communication rounds, highlighting the importance of early intervention. This insights can inform and inspire future work on MAS attacks and defenses.

Problem

Research questions and friction points this paper is trying to address.

multi-agent systems

LLM safety

embedding-based defenses

adversarial attacks

misinformation propagation

Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence-based defense

multi-agent systems

LLM safety