🤖 AI Summary
This work proposes the first self-healing multi-agent defense framework against sponge attacks—adversarial inputs that induce excessive computation in large language models (LLMs) to exhaust system resources. The framework employs a three-stage defense pipeline integrating semantic similarity retrieval, pattern matching, and LLM-based reasoning. Crucially, it incorporates a closed-loop self-healing mechanism through knowledge-updating and prompt-optimization agents, enabling dynamic adaptation of defense strategies even after an attack bypasses initial detection. Experimental results demonstrate that the approach achieves high F1 scores against both non-semantic and semantic sponge attacks, significantly outperforming existing defenses based on perplexity thresholds or single LLM inference.
📝 Abstract
Sponge attacks increasingly threaten LLM systems by inducing excessive computation and DoS. Existing defenses either rely on statistical filters that fail on semantically meaningful attacks or use static LLM-based detectors that struggle to adapt as attack strategies evolve. We introduce SHIELD, a multi-agent, auto-healing defense framework centered on a three-stage Defense Agent that integrates semantic similarity retrieval, pattern matching, and LLM-based reasoning. Two auxiliary agents, a Knowledge Updating Agent and a Prompt Optimization Agent, form a closed self-healing loop, when an attack bypasses detection, the system updates an evolving knowledgebase, and refines defense instructions. Extensive experiments show that SHIELD consistently outperforms perplexity-based and standalone LLM defenses, achieving high F1 scores across both non-semantic and semantic sponge attacks, demonstrating the effectiveness of agentic self-healing against evolving resource-exhaustion threats.