π€ AI Summary
This work addresses the vulnerability of memory-augmented large language model (LLM) agents to malicious users who can inject harmful content through seemingly benign interactions, thereby manipulating subsequent agent behavior. Existing defenses lack post-hoc traceability, leaving systems unable to identify compromised memories after the fact. To bridge this gap, the authors propose MemAudit, the first framework enabling post-hoc auditing of memory-augmented LLM agents. MemAudit leverages counterfactual causal analysis to compute memory influence scores and constructs a memory consistency graph for structural anomaly detection, allowing precise identification of toxic memories without requiring online intervention. Evaluated on question-answering and reasoning tasks, MemAudit reduces the success rate of MINJA attacks from 70% and 83.3% to 0%, substantially enhancing memory safety.
π Abstract
Large language model agents increasingly rely on persistent memory to store past interactions, retrieve relevant demonstrations, and improve long-horizon task execution. However, this memory mechanism also creates a practical security vulnerability: an adversarial user may inject malicious records into the agent's memory through ordinary interaction, and these records can later be retrieved to steer the agent's reasoning and actions. Existing defenses primarily focus on online intervention, such as prompt filtering or output blocking, but they do not address the post-hoc question of which stored memories are responsible after harmful behavior has already been observed. We propose \textbf{MemAudit}, a post-hoc causal memory auditing framework for memory-augmented LLM agents. The framework combines two complementary signals: (1) a counterfactual memory influence score that measures each memory's causal contribution to harmful outputs, and (2) a memory consistency graph that identifies structurally anomalous memories within the broader memory store. We evaluate MemAudit against MINJA, a query-only memory injection attack in which malicious records are generated and stored through normal agent interactions rather than direct memory-bank modification. Across both QA and reasoning-agent settings, MemAudit substantially reduces attack success rates under realistic post-hoc auditing scenarios. The results show that QA attack success is reduced from $70\%$ to $0\%$, while RAP attack success drops from $83.3\%$ to $0\%$.