Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work首次 identifies and systematically characterizes a novel data poisoning threat targeting the feedback phase of LLM prompt optimizers—revealing the optimization process itself as an emerging attack surface. We propose a lightweight False Reward Attack (FRA) that requires no access to the reward model, and design a context-highlighting–based defense mechanism. Under the HarmBench framework, iterative feedback-optimization experiments demonstrate that FRA increases attack success rate by ΔASR = 0.48; our defense reduces FRA’s ΔASR from 0.23 to 0.07—achieving substantial robustness improvement without degrading original optimization performance. Our core contributions are threefold: (1) formalizing the feedback poisoning threat model in prompt optimization; (2) introducing a reward-model–agnostic attack paradigm; and (3) delivering an efficient, performance-preserving defense for enhanced robustness.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) systems now underpin everyday AI applications such as chatbots, computer-use assistants, and autonomous robots, where performance often depends on carefully designed prompts. LLM-based prompt optimizers reduce that effort by iteratively refining prompts from scored feedback, yet the security of this optimization stage remains underexamined. We present the first systematic analysis of poisoning risks in LLM-based prompt optimization. Using HarmBench, we find systems are substantially more vulnerable to manipulated feedback than to injected queries: feedback-based attacks raise attack success rate (ASR) by up to $Δ$ASR = 0.48. We introduce a simple fake-reward attack that requires no access to the reward model and significantly increases vulnerability, and we propose a lightweight highlighting defense that reduces the fake-reward $Δ$ASR from 0.23 to 0.07 without degrading utility. These results establish prompt optimization pipelines as a first-class attack surface and motivate stronger safeguards for feedback channels and optimization frameworks.

Problem

Research questions and friction points this paper is trying to address.

Investigating poisoning vulnerabilities in LLM-based prompt optimization

Analyzing manipulated feedback risks in AI optimization pipelines

Proposing defenses against fake-reward attacks in prompt optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing poisoning risks in LLM-based prompt optimization

Introducing fake-reward attack without reward model access

Proposing highlighting defense to reduce attack success rate

🔎 Similar Papers

An Optimizable Suffix Is Worth A Thousand Templates: Efficient Black-box Jailbreaking without Affirmative Phrases via LLM as Optimizer