More Vulnerable than You Think: On the Stability of Tool-Integrated LLM Agents

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing LLM-based tool agent evaluations primarily focus on end-to-end functionality, neglecting holistic pipeline stability—hindering real-world deployment. Method: This work introduces the first systematic stability evaluation framework for the full tool-calling pipeline—encompassing documentation understanding, tool selection, parameter generation, and response parsing—via multi-round error injection and behavioral monitoring. We empirically validate it across 12 mainstream open- and closed-source models and five categories of tool-integrated tasks. Contributions/Results: (1) Open-source agents exhibit significantly lower stability than their closed-source counterparts; (2) scaling model size does not consistently improve robustness and may exacerbate adversarial fragility; (3) stealthy, user-like instruction perturbations substantially increase failure rates. Our study establishes the first comprehensive benchmark for full-pipeline tool-calling stability, identifies critical vulnerability points, and provides actionable insights for designing more robust tool-augmented agents.

Technology Category

Application Category

📝 Abstract

Current evaluations of tool-integrated LLM agents typically focus on end-to-end tool-usage evaluation while neglecting their stability. This limits their real-world applicability, as various internal or external factors can cause agents to crash or behave abnormally. Our research addresses this by investigating whether agents are vulnerable to errors throughout the entire tool invocation process, including reading tool documentation, selecting tools and generating parameters, and processing the tool's response. Through extensive experiments, we observe that agents are highly susceptible to errors at each stage and agents based on open-source models are more vulnerable than those based on proprietary models. We also find that increasing the model size does not significantly improve tool invocation reasoning and may make agents more vulnerable to attacks resembling normal user instructions. This highlights the importance of evaluating agent stability and offers valuable insights for future LLM development and evaluation.

Problem

Research questions and friction points this paper is trying to address.

Assessing stability vulnerabilities in tool-integrated LLM agents

Analyzing error susceptibility across tool invocation stages

Evaluating model size impact on agent robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Investigates tool-integrated LLM agents stability

Tests error vulnerability in tool invocation stages

Evaluates model size impact on agent vulnerability

🔎 Similar Papers

The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies