Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a covert security threat to large language model (LLM) agents known as “dormant attacks,” wherein adversaries embed adversarial content into persistent states—such as conversation context, memory, or tool-use skills—that later trigger harmful behaviors in response to benign user queries. The study formally characterizes this cross-interaction vulnerability, exposes the inadequacy of current defense mechanisms, and introduces a comprehensive evaluation benchmark comprising 1,896 instances spanning six harm categories, three attack strategies, and three state targets. Experiments across seven mainstream LLMs demonstrate that dormant attacks remain highly effective despite low success rates in single-turn settings, underscoring their stealthiness and potential severity in multi-turn interactive environments.
📝 Abstract
Large Language Model (LLM) agents remain vulnerable to safety threats from the external environment, where attackers inject adversarial content into external observations such as tool-returned data, webpages, or MCP context, causing harmful agentic behaviors such as unsafe actions or incorrect outputs. Existing studies typically focus on single-interaction attacks, where the agent observes adversarial content and immediately exhibits harmful behavior within one user request. However, we show that adversarial content can also persist across interactions served by the same agent, making such threats harder to detect and mitigate. Specifically, adversarial content may persist in the agent state, remain dormant across interactions, and later be activated by a benign user query. We formalize this type of safety threat as Sleeper Attack. To evaluate it, we construct a benchmark with 1,896 instances covering six real-world harmful outcomes, three attack strategies, and three agent state targets: session context, memory, and reusable skills. Experiments on seven strong open-source and closed-source LLMs show that state-of-the-art LLM agents remain vulnerable to Sleeper Attack, even when they achieve low attack success rates under a single-interaction baseline. Our code and data are available at https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef.
Problem

Research questions and friction points this paper is trying to address.

Sleeper Attack
Large Language Model Agents
Adversarial Content
Agent State Persistence
Safety Threats
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sleeper Attack
LLM agent safety
persistent adversarial content
cross-interaction threat
agent state vulnerability
🔎 Similar Papers
No similar papers found.
Yongxiang Li
Yongxiang Li
Professor, RMIT University
Electronic Materials and Devices
Moxin Li
Moxin Li
National University of Singapore
natural language processing
Z
Zhixin Ma
Singapore Management University
Fengbin Zhu
Fengbin Zhu
National University of Singapore
NLPIRLLMDocument AIAI + Finance
D
Dongrui Liu
Shanghai Artificial Intelligence Laboratory
W
Wenjie Wang
University of Science and Technology of China
F
Fuli Feng
University of Science and Technology of China