🤖 AI Summary
This work首次 identifies and systematically characterizes a novel data poisoning threat targeting the feedback phase of LLM prompt optimizers—revealing the optimization process itself as an emerging attack surface. We propose a lightweight False Reward Attack (FRA) that requires no access to the reward model, and design a context-highlighting–based defense mechanism. Under the HarmBench framework, iterative feedback-optimization experiments demonstrate that FRA increases attack success rate by ΔASR = 0.48; our defense reduces FRA’s ΔASR from 0.23 to 0.07—achieving substantial robustness improvement without degrading original optimization performance. Our core contributions are threefold: (1) formalizing the feedback poisoning threat model in prompt optimization; (2) introducing a reward-model–agnostic attack paradigm; and (3) delivering an efficient, performance-preserving defense for enhanced robustness.
📝 Abstract
Large language model (LLM) systems now underpin everyday AI applications such as chatbots, computer-use assistants, and autonomous robots, where performance often depends on carefully designed prompts. LLM-based prompt optimizers reduce that effort by iteratively refining prompts from scored feedback, yet the security of this optimization stage remains underexamined. We present the first systematic analysis of poisoning risks in LLM-based prompt optimization. Using HarmBench, we find systems are substantially more vulnerable to manipulated feedback than to injected queries: feedback-based attacks raise attack success rate (ASR) by up to $Δ$ASR = 0.48. We introduce a simple fake-reward attack that requires no access to the reward model and significantly increases vulnerability, and we propose a lightweight highlighting defense that reduces the fake-reward $Δ$ASR from 0.23 to 0.07 without degrading utility. These results establish prompt optimization pipelines as a first-class attack surface and motivate stronger safeguards for feedback channels and optimization frameworks.