When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
This work addresses the risk of unintended long-term state poisoning in personalized large language model agents, where mundane interactions can corrupt persistent internal states, leading to blurred authorization boundaries, tool misuse, and behavioral drift. The study formally defines this previously uncharacterized threat and introduces StateGuard, a lightweight defense framework that combines state discrepancy auditing with selective rollback mechanisms. To evaluate such risks, the authors construct ULSPB, a bilingual benchmark comprising 350 scenarios, along with a quantitative Harm Score metric. Experimental results demonstrate that StateGuard effectively reduces harm scores to near zero across four mainstream models, substantially lowering false negative rates while incurring only an acceptable level of false positives and minimal computational overhead.
📝 Abstract
Personalized LLM agents maintain persistent cross-session state to support long-horizon collaboration. Yet, this persistence introduces a subtle but critical security vulnerability: routine user-agent interactions can gradually reshape an agent's long-term state, inadvertently weakening future confirmation boundaries, expanding tool-use defaults, and escalating autonomous behavior over time. We formalize this risk as \textbf{unintended long-term state poisoning}. To systematically study it, we introduce the \textbf{Unintended Long-Term State Poisoning Bench (ULSPB)}, a bilingual benchmark comprising $350$ settings spanning five assistance categories, seven interaction patterns, 24-turn routine interactions, and matched single-injection counterparts. Furthermore, we define the \emph{Harm Score} (HS), a state-centric metric that quantifies \emph{authorization drift}, \emph{tool-use escalation}, and \emph{unchecked autonomy}. Experiments on OpenClaw with four backbone LLMs demonstrate that, while single-injection is generally effective, routine conversations alone can substantially poison long-term state, primarily corrupting memory-centric artifacts. Evaluations seeded with real-world user interactions confirm that this risk is not a mere artifact of synthetic prompts. To mitigate this threat, we propose \textbf{StateGuard}, a lightweight, post-execution defense that audits state diffs at the writeback boundary and selectively rolls back dangerous edits. Across all evaluated models, StateGuard reduces HS to near zero and lowers false-negative rates, with acceptable high false-positive rates under a safety-first writeback defense and minimal overhead.
Problem

Research questions and friction points this paper is trying to address.

unintended long-term state poisoning
personalized agents
persistent state
authorization drift
autonomous behavior
Innovation

Methods, ideas, or system contributions that make the work stand out.

unintended long-term state poisoning
personalized LLM agents
StateGuard
Harm Score
persistent state security
🔎 Similar Papers
💼 Related Jobs