More Vulnerable than You Think: On the Stability of Tool-Integrated LLM Agents

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based tool agent evaluations primarily focus on end-to-end functionality, neglecting holistic pipeline stability—hindering real-world deployment. Method: This work introduces the first systematic stability evaluation framework for the full tool-calling pipeline—encompassing documentation understanding, tool selection, parameter generation, and response parsing—via multi-round error injection and behavioral monitoring. We empirically validate it across 12 mainstream open- and closed-source models and five categories of tool-integrated tasks. Contributions/Results: (1) Open-source agents exhibit significantly lower stability than their closed-source counterparts; (2) scaling model size does not consistently improve robustness and may exacerbate adversarial fragility; (3) stealthy, user-like instruction perturbations substantially increase failure rates. Our study establishes the first comprehensive benchmark for full-pipeline tool-calling stability, identifies critical vulnerability points, and provides actionable insights for designing more robust tool-augmented agents.

Technology Category

Application Category

📝 Abstract
Current evaluations of tool-integrated LLM agents typically focus on end-to-end tool-usage evaluation while neglecting their stability. This limits their real-world applicability, as various internal or external factors can cause agents to crash or behave abnormally. Our research addresses this by investigating whether agents are vulnerable to errors throughout the entire tool invocation process, including reading tool documentation, selecting tools and generating parameters, and processing the tool's response. Through extensive experiments, we observe that agents are highly susceptible to errors at each stage and agents based on open-source models are more vulnerable than those based on proprietary models. We also find that increasing the model size does not significantly improve tool invocation reasoning and may make agents more vulnerable to attacks resembling normal user instructions. This highlights the importance of evaluating agent stability and offers valuable insights for future LLM development and evaluation.
Problem

Research questions and friction points this paper is trying to address.

Assessing stability vulnerabilities in tool-integrated LLM agents
Analyzing error susceptibility across tool invocation stages
Evaluating model size impact on agent robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Investigates tool-integrated LLM agents stability
Tests error vulnerability in tool invocation stages
Evaluates model size impact on agent vulnerability
🔎 Similar Papers
Weimin Xiong
Weimin Xiong
Peking University
Computer Science
K
Ke Wang
Huawei Technologies
Y
Yifan Song
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Hanchao Liu
Hanchao Liu
Huawei Technologies
S
Sai Zhou
Huawei Technologies
W
Wei Peng
Huawei Technologies
S
Sujian Li
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University