🤖 AI Summary
This work addresses the vulnerability of large language model agents to memory extraction attacks, for which existing defenses offer limited efficacy. We propose the first theoretically sound defense framework that generates highly deceptive adversarial honeypot documents through a two-stage optimization process—designed to be imperceptible to legitimate users—while integrating Wald’s Sequential Probability Ratio Test (SPRT) for efficient attack detection. Our approach achieves substantial improvements over the best static detector, boosting AUROC by 50% and true positive rate by 80% under low false positive rates, all without compromising the agent’s task utility or introducing any inference latency.
📝 Abstract
Large Language Model (LLM)-based agents employ external and internal memory systems to handle complex, goal-oriented tasks, yet this exposes them to severe extraction attacks, and effective defenses remain lacking. In this paper, we propose MemPot, the first theoretically verified defense framework against memory extraction attacks by injecting optimized honeypots into the memory. Through a two-stage optimization process, MemPot generates trap documents that maximize the retrieval probability for attackers while remaining inconspicuous to benign users. We model the detection process as Wald's Sequential Probability Ratio Test (SPRT) and theoretically prove that MemPot achieves a lower average number of sampling rounds compared to optimal static detectors. Empirically, MemPot significantly outperforms state-of-the-art baselines, achieving a 50% improvement in detection AUROC and an 80% increase in True Positive Rate under low False Positive Rate constraints. Furthermore, our experiments confirm that MemPot incurs zero additional online inference latency and preserves the agent's utility on standard tasks, verifying its superiority in safety, harmlessness, and efficiency.