π€ AI Summary
Traditional text-level safety mechanisms fail to adequately address security risks arising from LLM agent behaviors. To bridge this gap, this paper proposes GuardAgentβthe first dynamic, agent-level safety guard framework. Its core comprises knowledge-enhanced, two-stage LLM reasoning (security requirement parsing β plan-to-code mapping) integrated with memory-augmented contextual retrieval, enabling real-time behavioral verification of target agents and generation of lightweight, executable code-based safeguards. We introduce a novel agent-level safety guarding paradigm and establish two domain-specific, rigorously designed benchmarks: EICU-AC (for healthcare access control) and Mind2Web-SC (for secure web interaction). Experimental results demonstrate that GuardAgent achieves 98% and 83% guard accuracy on these benchmarks, respectively, effectively suppressing policy violations while maintaining high flexibility, low computational overhead, and strong generalization across diverse agent tasks and environments.
π Abstract
The rapid advancement of large language model (LLM) agents has raised new concerns regarding their safety and security, which cannot be addressed by traditional textual-harm-focused LLM guardrails. We propose GuardAgent, the first guardrail agent to protect the target agents by dynamically checking whether their actions satisfy given safety guard requests. Specifically, GuardAgent first analyzes the safety guard requests to generate a task plan, and then maps this plan into guardrail code for execution. By performing the code execution, GuardAgent can deterministically follow the safety guard request and safeguard target agents. In both steps, an LLM is utilized as the reasoning component, supplemented by in-context demonstrations retrieved from a memory module storing experiences from previous tasks. GuardAgent can understand different safety guard requests and provide reliable code-based guardrails with high flexibility and low operational overhead. In addition, we propose two novel benchmarks: EICU-AC benchmark to assess the access control for healthcare agents and Mind2Web-SC benchmark to evaluate the safety policies for web agents. We show that GuardAgent effectively moderates the violation actions for different types of agents on these two benchmarks with over 98% and 83% guardrail accuracies, respectively. Project page: https://guardagent.github.io/