ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the security risks inherent in tool invocation by large language model (LLM) agents, where existing approaches lack fine-grained, real-time monitoring and intervention mechanisms. To tackle this challenge, the authors introduce TS-Bench, a novel benchmark for evaluating tool-use safety, and propose TS-Guard, a multi-task reinforcement learning–based defense model that enables step-level safety detection and feedback during tool calls for the first time. Integrated within the TS-Flow framework, TS-Guard leverages interaction history to perform interpretable and generalizable safety reasoning, seamlessly embedding into the ReAct agent pipeline. Experimental results demonstrate that the approach reduces harmful tool invocations by an average of 65% under prompt injection attacks while simultaneously improving benign task completion rates by approximately 10%.

Technology Category

Application Category

📝 Abstract

While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.

Problem

Research questions and friction points this paper is trying to address.

tool invocation safety

LLM-based agents

proactive intervention

step-level monitoring

security risks

Innovation

Methods, ideas, or system contributions that make the work stand out.

step-level guardrail

tool invocation safety

multi-task reinforcement learning