🤖 AI Summary
This work addresses the emergent security risks in large language model (LLM)-based agents, where cross-component trust boundaries in planning, memory, and tool invocation introduce vulnerabilities beyond the scope of conventional text-based safety evaluations. The authors propose AgentFence, a novel architecture that shifts the focus of agent safety to operational trajectory risks, defining 14 attack models spanning planning, memory, retrieval, tool use, and delegation. They introduce a dynamic security assessment paradigm grounded in goal- and permission-envelope analysis, coupled with a traceable dialogue fracture detection mechanism. This enables fine-grained trajectory auditing through state/goal integrity verification, privilege escalation detection, and attack correlation analysis. Experiments reveal significant variation in multi-step boundary risk (MSBR) across agent frameworks (e.g., LangGraph: 0.29 vs. AutoGPT: 0.51), with boundary violations predominating (31% SIV). High-severity threats such as Denial-of-Wallet are identified, and a strong correlation (ρ≈0.63) is established between authorization confusion and goal hijacking.
📝 Abstract
Large language models are increasingly deployed as *deep agents* that plan, maintain persistent state, and invoke external tools, shifting safety failures from unsafe text to unsafe *trajectories*. We introduce **AgentFence**, an architecture-centric security evaluation that defines 14 trust-boundary attack classes spanning planning, memory, retrieval, tool use, and delegation, and detects failures via *trace-auditable conversation breaks* (unauthorized or unsafe tool use, wrong-principal actions, state/objective integrity violations, and attack-linked deviations). Holding the base model fixed, we evaluate eight agent archetypes under persistent multi-turn interaction and observe substantial architectural variation in mean security break rate (MSBR), ranging from $0.29 \pm 0.04$ (LangGraph) to $0.51 \pm 0.07$ (AutoGPT). The highest-risk classes are operational: Denial-of-Wallet ($0.62 \pm 0.08$), Authorization Confusion ($0.54 \pm 0.10$), Retrieval Poisoning ($0.47 \pm 0.09$), and Planning Manipulation ($0.44 \pm 0.11$), while prompt-centric classes remain below $0.20$ under standard settings. Breaks are dominated by boundary violations (SIV 31%, WPA 27%, UTI+UTA 24%, ATD 18%), and authorization confusion correlates with objective and tool hijacking ($\rho \approx 0.63$ and $\rho \approx 0.58$). AgentFence reframes agent security around what matters operationally: whether an agent stays within its goal and authority envelope over time.