STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This paper identifies a novel security threat—Sequential Tool Attack Chains (STAC)—targeting embodied LLM agents with tool-calling capabilities. STACs exploit multi-turn, ostensibly benign tool invocations to implicitly trigger harmful behavior in the final execution step, evading conventional content-level safety detectors. Method: The authors introduce the first systematic STAC attack framework, featuring closed-loop automated attack generation, in-environment execution validation, and reverse-engineered stealthy dialogue sequence synthesis. It comprehensively covers ten failure modes across diverse agent architectures. Contribution/Results: Evaluated on state-of-the-art models including GPT-4.1, the framework achieves >90% attack success rate. A novel defense mechanism—monitoring reasoning traces during tool invocation sequences—reduces success rates by 28.8%, marking the first work to both expose and mitigate deep jailbreak risks inherent in long-horizon tool-using agent behaviors.

Technology Category

Application Category

📝 Abstract

As LLMs advance into autonomous agents with tool-use capabilities, they introduce security challenges that extend beyond traditional content-based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC), a novel multi-turn attack framework that exploits agent tool use. STAC chains together tool calls that each appear harmless in isolation but, when combined, collectively enable harmful operations that only become apparent at the final execution step. We apply our framework to automatically generate and systematically evaluate 483 STAC cases, featuring 1,352 sets of user-agent-environment interactions and spanning diverse domains, tasks, agent types, and 10 failure modes. Our evaluations show that state-of-the-art LLM agents, including GPT-4.1, are highly vulnerable to STAC, with attack success rates (ASR) exceeding 90% in most cases. The core design of STAC's automated framework is a closed-loop pipeline that synthesizes executable multi-step tool chains, validates them through in-environment execution, and reverse-engineers stealthy multi-turn prompts that reliably induce agents to execute the verified malicious sequence. We further perform defense analysis against STAC and find that existing prompt-based defenses provide limited protection. To address this gap, we propose a new reasoning-driven defense prompt that achieves far stronger protection, cutting ASR by up to 28.8%. These results highlight a crucial gap: defending tool-enabled agents requires reasoning over entire action sequences and their cumulative effects, rather than evaluating isolated prompts or responses.

Problem

Research questions and friction points this paper is trying to address.

STAC exploits tool chaining to bypass LLM agent safety mechanisms

It combines seemingly harmless tools to enable harmful final operations

The framework demonstrates high vulnerability rates in state-of-the-art agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential tool chains exploit isolated harmless calls

Automated pipeline synthesizes and validates multi-step attacks

Reasoning-driven defense reduces attack success rates significantly

🔎 Similar Papers

The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models