Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work首次 identifies and systematically characterizes a novel data poisoning threat targeting the feedback phase of LLM prompt optimizers—revealing the optimization process itself as an emerging attack surface. We propose a lightweight False Reward Attack (FRA) that requires no access to the reward model, and design a context-highlighting–based defense mechanism. Under the HarmBench framework, iterative feedback-optimization experiments demonstrate that FRA increases attack success rate by ΔASR = 0.48; our defense reduces FRA’s ΔASR from 0.23 to 0.07—achieving substantial robustness improvement without degrading original optimization performance. Our core contributions are threefold: (1) formalizing the feedback poisoning threat model in prompt optimization; (2) introducing a reward-model–agnostic attack paradigm; and (3) delivering an efficient, performance-preserving defense for enhanced robustness.

Technology Category

Application Category

📝 Abstract
Large language model (LLM) systems now underpin everyday AI applications such as chatbots, computer-use assistants, and autonomous robots, where performance often depends on carefully designed prompts. LLM-based prompt optimizers reduce that effort by iteratively refining prompts from scored feedback, yet the security of this optimization stage remains underexamined. We present the first systematic analysis of poisoning risks in LLM-based prompt optimization. Using HarmBench, we find systems are substantially more vulnerable to manipulated feedback than to injected queries: feedback-based attacks raise attack success rate (ASR) by up to $Δ$ASR = 0.48. We introduce a simple fake-reward attack that requires no access to the reward model and significantly increases vulnerability, and we propose a lightweight highlighting defense that reduces the fake-reward $Δ$ASR from 0.23 to 0.07 without degrading utility. These results establish prompt optimization pipelines as a first-class attack surface and motivate stronger safeguards for feedback channels and optimization frameworks.
Problem

Research questions and friction points this paper is trying to address.

Investigating poisoning vulnerabilities in LLM-based prompt optimization
Analyzing manipulated feedback risks in AI optimization pipelines
Proposing defenses against fake-reward attacks in prompt optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing poisoning risks in LLM-based prompt optimization
Introducing fake-reward attack without reward model access
Proposing highlighting defense to reduce attack success rate
🔎 Similar Papers
No similar papers found.
Andrew Zhao
Andrew Zhao
Tsinghua University
Reinforcement LearningLanguage AgentReasoning
R
Reshmi Ghosh
Microsoft
V
Vitor Carvalho
Microsoft
E
Emily Lawton
Microsoft
K
Keegan Hines
Microsoft
G
Gao Huang
Tsinghua University
Jack W. Stokes
Jack W. Stokes
Microsoft Research
machine learningsecurity