π€ AI Summary
This work addresses the challenge that large language model agents often suffer degraded utility on benign tasks due to overly conservative safety mechanisms, while still needing robust defenses against malicious attacks. To resolve this trade-off, the authors propose a training-free, plug-and-play safety guardrail framework that integrates context-aware dynamic defense rules with a localized hierarchical memory system. A key innovation is an information entropyβbased self-evolution mechanism that dynamically splits and merges memory nodes to precisely delineate safety decision boundaries. Experimental results on GPT-4o demonstrate that the approach maintains a rejection rate exceeding 93% on harmful requests while boosting utility on ambiguous benign tasks to 63.6%, significantly outperforming existing methods.
π Abstract
With the rapid evolution of foundation models, Large Language Model (LLM) agents have demonstrated increasingly powerful tool-use capabilities. However, this proficiency introduces significant security risks, as malicious actors can manipulate agents into executing tools to generate harmful content. While existing defensive mechanisms are effective, they frequently suffer from the over-refusal problem, where increased safety strictness compromises the agent's utility on benign tasks. To mitigate this trade-off, we propose \textsc{SafeHarbor}, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, \textsc{SafeHarbor} extracts context-aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training-free, efficient, and plug-and-play solution. Furthermore, we introduce an information entropy-based self-evolution mechanism that continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that \textsc{SafeHarbor} achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6\% on GPT-4o while maintaining a robust refusal rate exceeding 93\% against harmful requests. The source code is publicly available at https://github.com/ljj-cyber/SafeHarbor.