Chain-of-Trigger: An Agentic Backdoor that Paradoxically Enhances Agentic Robustness

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes a critical security vulnerability in large language model (LLM)-based intelligent agents: susceptibility to multi-step, long-horizon backdoor attacks in realistic, interactive environments. To this end, we propose Chain-of-Trigger (CoTri), the first backdoor framework enabling dynamic, environment-aware multi-step control. CoTri models environmental stochasticity to design an ordered sequence of triggers, leveraging agent feedback to progressively steer behavior away from the intended task. Counterintuitively, the backdoor simultaneously enhances the agent’s robustness and performance on clean tasks—thereby significantly improving stealth. Extensive experiments demonstrate that CoTri achieves near-perfect attack success rates (~100%) with negligible false-positive triggering across both LLM-based and vision-language model agents. Moreover, it exhibits strong cross-modal generalizability. CoTri establishes a new paradigm for agent security evaluation and provides a potent, realistic benchmark for assessing backdoor resilience in autonomous agents.

Technology Category

Application Category

📝 Abstract
The rapid deployment of large language model (LLM)-based agents in real-world applications has raised serious concerns about their trustworthiness. In this work, we reveal the security and robustness vulnerabilities of these agents through backdoor attacks. Distinct from traditional backdoors limited to single-step control, we propose the Chain-of-Trigger Backdoor (CoTri), a multi-step backdoor attack designed for long-horizon agentic control. CoTri relies on an ordered sequence. It starts with an initial trigger, and subsequent ones are drawn from the environment, allowing multi-step manipulation that diverts the agent from its intended task. Experimental results show that CoTri achieves a near-perfect attack success rate (ASR) while maintaining a near-zero false trigger rate (FTR). Due to training data modeling the stochastic nature of the environment, the implantation of CoTri paradoxically enhances the agent's performance on benign tasks and even improves its robustness against environmental distractions. We further validate CoTri on vision-language models (VLMs), confirming its scalability to multimodal agents. Our work highlights that CoTri achieves stable, multi-step control within agents, improving their inherent robustness and task capabilities, which ultimately makes the attack more stealthy and raises potential safty risks.
Problem

Research questions and friction points this paper is trying to address.

Proposes multi-step backdoor attack for long-horizon agentic control
Reveals security vulnerabilities in LLM-based agents through backdoors
Achieves high attack success while enhancing benign task performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-step backdoor attack for agentic control
Ordered trigger sequence from environment interactions
Paradoxically enhances agent robustness and performance
🔎 Similar Papers
No similar papers found.
J
Jiyang Qiu
School of Computer Science, Shanghai Jiao Tong University
Xinbei Ma
Xinbei Ma
Shanghai Jiao Tong University
Y
Yunqing Xu
School of Computer Science, Shanghai Jiao Tong University
Zhuosheng Zhang
Zhuosheng Zhang
Assistant Professor at Shanghai Jiao Tong University
Natural Language ProcessingLarge Language ModelsReasoningAI SafetyMulti-Agent Learning
H
Hai Zhao
School of Computer Science, Shanghai Jiao Tong University