🤖 AI Summary
This work exposes a critical security vulnerability in large language model (LLM)-based intelligent agents: susceptibility to multi-step, long-horizon backdoor attacks in realistic, interactive environments. To this end, we propose Chain-of-Trigger (CoTri), the first backdoor framework enabling dynamic, environment-aware multi-step control. CoTri models environmental stochasticity to design an ordered sequence of triggers, leveraging agent feedback to progressively steer behavior away from the intended task. Counterintuitively, the backdoor simultaneously enhances the agent’s robustness and performance on clean tasks—thereby significantly improving stealth. Extensive experiments demonstrate that CoTri achieves near-perfect attack success rates (~100%) with negligible false-positive triggering across both LLM-based and vision-language model agents. Moreover, it exhibits strong cross-modal generalizability. CoTri establishes a new paradigm for agent security evaluation and provides a potent, realistic benchmark for assessing backdoor resilience in autonomous agents.
📝 Abstract
The rapid deployment of large language model (LLM)-based agents in real-world applications has raised serious concerns about their trustworthiness. In this work, we reveal the security and robustness vulnerabilities of these agents through backdoor attacks. Distinct from traditional backdoors limited to single-step control, we propose the Chain-of-Trigger Backdoor (CoTri), a multi-step backdoor attack designed for long-horizon agentic control. CoTri relies on an ordered sequence. It starts with an initial trigger, and subsequent ones are drawn from the environment, allowing multi-step manipulation that diverts the agent from its intended task. Experimental results show that CoTri achieves a near-perfect attack success rate (ASR) while maintaining a near-zero false trigger rate (FTR). Due to training data modeling the stochastic nature of the environment, the implantation of CoTri paradoxically enhances the agent's performance on benign tasks and even improves its robustness against environmental distractions. We further validate CoTri on vision-language models (VLMs), confirming its scalability to multimodal agents. Our work highlights that CoTri achieves stable, multi-step control within agents, improving their inherent robustness and task capabilities, which ultimately makes the attack more stealthy and raises potential safty risks.