How Not to Detect Prompt Injections with an LLM

๐Ÿ“… 2025-07-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work exposes a fundamental structural vulnerability in Known-Answer Detection (KAD), a mainstream defense against prompt injection attacks: KAD relies on fixed-answer patterns for input classification and is inherently incapable of detecting semantically consistent yet syntactically variant malicious instructions. To address this, we present the first formal model of KAD and propose DataFlipโ€”a black-box, adaptive attack that requires no white-box access, gradient information, or optimization. DataFlip systematically evades detection via semantic-preserving input rewriting and dynamic answer perturbation. Experiments demonstrate that DataFlip reduces KADโ€™s detection rate to 1.5% while achieving an 88% attack success rate. These results fundamentally challenge the security-validity assumption underlying high-accuracy KAD variants. Our work provides critical theoretical insights and practical warnings for the design and evaluation of LLM security defenses.

Technology Category

Application Category

๐Ÿ“ Abstract
LLM-integrated applications and agents are vulnerable to prompt injection attacks, in which adversaries embed malicious instructions within seemingly benign user inputs to manipulate the LLM's intended behavior. Recent defenses based on $ extit{known-answer detection}$ (KAD) have achieved near-perfect performance by using an LLM to classify inputs as clean or contaminated. In this work, we formally characterize the KAD framework and uncover a structural vulnerability in its design that invalidates its core security premise. We design a methodical adaptive attack, $ extit{DataFlip}$, to exploit this fundamental weakness. It consistently evades KAD defenses with detection rates as low as $1.5%$ while reliably inducing malicious behavior with success rates of up to $88%$, without needing white-box access to the LLM or any optimization procedures.
Problem

Research questions and friction points this paper is trying to address.

Detecting prompt injection attacks in LLM applications
Exposing structural vulnerability in known-answer detection defenses
Evading KAD defenses with high success rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Exposes vulnerability in known-answer detection
Introduces DataFlip adaptive attack method
Achieves high attack success without optimization
๐Ÿ”Ž Similar Papers
No similar papers found.