Agent-Fence: Mapping Security Vulnerabilities Across Deep Research Agents

📅 2026-02-07

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the emergent security risks in large language model (LLM)-based agents, where cross-component trust boundaries in planning, memory, and tool invocation introduce vulnerabilities beyond the scope of conventional text-based safety evaluations. The authors propose AgentFence, a novel architecture that shifts the focus of agent safety to operational trajectory risks, defining 14 attack models spanning planning, memory, retrieval, tool use, and delegation. They introduce a dynamic security assessment paradigm grounded in goal- and permission-envelope analysis, coupled with a traceable dialogue fracture detection mechanism. This enables fine-grained trajectory auditing through state/goal integrity verification, privilege escalation detection, and attack correlation analysis. Experiments reveal significant variation in multi-step boundary risk (MSBR) across agent frameworks (e.g., LangGraph: 0.29 vs. AutoGPT: 0.51), with boundary violations predominating (31% SIV). High-severity threats such as Denial-of-Wallet are identified, and a strong correlation (ρ≈0.63) is established between authorization confusion and goal hijacking.

Technology Category

Application Category

📝 Abstract

Large language models are increasingly deployed as *deep agents* that plan, maintain persistent state, and invoke external tools, shifting safety failures from unsafe text to unsafe *trajectories*. We introduce **AgentFence**, an architecture-centric security evaluation that defines 14 trust-boundary attack classes spanning planning, memory, retrieval, tool use, and delegation, and detects failures via *trace-auditable conversation breaks* (unauthorized or unsafe tool use, wrong-principal actions, state/objective integrity violations, and attack-linked deviations). Holding the base model fixed, we evaluate eight agent archetypes under persistent multi-turn interaction and observe substantial architectural variation in mean security break rate (MSBR), ranging from $0.29 \pm 0.04$ (LangGraph) to $0.51 \pm 0.07$ (AutoGPT). The highest-risk classes are operational: Denial-of-Wallet ($0.62 \pm 0.08$), Authorization Confusion ($0.54 \pm 0.10$), Retrieval Poisoning ($0.47 \pm 0.09$), and Planning Manipulation ($0.44 \pm 0.11$), while prompt-centric classes remain below $0.20$ under standard settings. Breaks are dominated by boundary violations (SIV 31%, WPA 27%, UTI+UTA 24%, ATD 18%), and authorization confusion correlates with objective and tool hijacking ($\rho \approx 0.63$ and $\rho \approx 0.58$). AgentFence reframes agent security around what matters operationally: whether an agent stays within its goal and authority envelope over time.

Problem

Research questions and friction points this paper is trying to address.

deep agents

security vulnerabilities

trust boundaries

unsafe trajectories

agent architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

AgentFence

trust-boundary attacks

trace-auditable conversation breaks