SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

πŸ“… 2026-05-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

226K/year
πŸ€– AI Summary
This work addresses the challenge that large language model agents often suffer degraded utility on benign tasks due to overly conservative safety mechanisms, while still needing robust defenses against malicious attacks. To resolve this trade-off, the authors propose a training-free, plug-and-play safety guardrail framework that integrates context-aware dynamic defense rules with a localized hierarchical memory system. A key innovation is an information entropy–based self-evolution mechanism that dynamically splits and merges memory nodes to precisely delineate safety decision boundaries. Experimental results on GPT-4o demonstrate that the approach maintains a rejection rate exceeding 93% on harmful requests while boosting utility on ambiguous benign tasks to 63.6%, significantly outperforming existing methods.
πŸ“ Abstract
With the rapid evolution of foundation models, Large Language Model (LLM) agents have demonstrated increasingly powerful tool-use capabilities. However, this proficiency introduces significant security risks, as malicious actors can manipulate agents into executing tools to generate harmful content. While existing defensive mechanisms are effective, they frequently suffer from the over-refusal problem, where increased safety strictness compromises the agent's utility on benign tasks. To mitigate this trade-off, we propose \textsc{SafeHarbor}, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, \textsc{SafeHarbor} extracts context-aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training-free, efficient, and plug-and-play solution. Furthermore, we introduce an information entropy-based self-evolution mechanism that continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that \textsc{SafeHarbor} achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6\% on GPT-4o while maintaining a robust refusal rate exceeding 93\% against harmful requests. The source code is publicly available at https://github.com/ljj-cyber/SafeHarbor.
Problem

Research questions and friction points this paper is trying to address.

LLM agent safety
over-refusal
tool-use security
harmful content generation
safety-utility trade-off
Innovation

Methods, ideas, or system contributions that make the work stand out.

memory-augmented guardrail
context-aware defense rules
hierarchical memory
self-evolution mechanism
over-refusal mitigation
πŸ’Ό Related Jobs
Z
Zhe Liu
School of Cyber Science and Technology, Beihang University, Beijing, China
Zonghao Ying
Zonghao Ying
SKLCCSE, BUAA
Trustworthy AI
Wenxin Zhang
Wenxin Zhang
University of Chinese Academy of Sciences
Deep LearningSelf-supervised LearningGraph neural networks
Q
Quanchen Zou
360 AI Security Lab, Beijing, China
D
Deyue Zhang
360 AI Security Lab, Beijing, China
D
Dongdong Yang
360 AI Security Lab, Beijing, China
Xiangzheng Zhang
Xiangzheng Zhang
360
AI safetyLarge language modelsInformation Retrieval
Hao Peng
Hao Peng
Beihang University, Professor
Social Event DetectionAnomaly DetectionReinforcement Learning