🤖 AI Summary
This work identifies a critical security risk in long-term memory mechanisms of large language model (LLM) agents: their potential misuse for covertly exfiltrating users’ sensitive data. We propose Trojan Hippo, an attack that implants a dormant payload via a single untrusted tool invocation, which later activates upon user queries involving sensitive topics to leak high-value information. For the first time, we systematically evaluate persistent memory attacks across heterogeneous memory architectures—including explicit tool memory, agent memory, retrieval-augmented generation (RAG), and sliding-window context—and introduce an adaptive red-teaming evaluation framework based on OpenEvolve, incorporating a security–utility trade-off analysis. Experiments demonstrate attack success rates of 85–100% on mainstream LLMs; while existing defenses can reduce this to 0–5%, they incur substantial utility degradation, highlighting the practical challenges of secure deployment.
📝 Abstract
Memory systems enable otherwise-stateless LLM agents to persist user information across sessions, but also introduce a new attack surface. We characterize the Trojan Hippo attack, a class of persistent memory attacks that operates in a more realistic threat model than prior memory poisoning work: the attacker plants a dormant payload into an agent's long-term memory via a single untrusted tool call (e.g., a crafted email), which activates only when the user later discusses sensitive topics such as finance, health, or identity, and exfiltrates high-value personal data to the attacker.
While anecdotal demonstrations of such attacks have appeared against deployed systems, no prior work systematically evaluates them across heterogeneous memory architectures and defenses.We introduce a dynamic evaluation framework comprising two components: (1) an OpenEvolve-based adaptive red-teaming benchmark that stress-tests defenses and memory backends against continuously refined attacks, and (2) the first capability-aware security/utility analysis for persistent memory systems, enabling principled reasoning about defense deployment across different usage profiles.
Instantiated on an email assistant across four memory backends (explicit tool memory, agentic memory, RAG, and sliding-window context), Trojan Hippo achieves up to 85-100 percent ASR against current frontier models from OpenAI and Google, with planted memories successfully activating even after 100 benign sessions. We evaluate four memory-system defenses inspired by basic security principles, finding they substantially reduce attack success rates (to as low as 0-5 percent), though at utility costs that vary widely with task requirements. Because of this substantial security-utility tradeoff, the effective real-world deployment of defenses remains an open challenge, which our evaluation framework is specifically designed to address.