The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work reveals a fundamental “defense trilemma” confronting current input-preprocessing-based defenses against prompt injection attacks: it is impossible to simultaneously achieve continuity, utility preservation, and perfect security. Through formal analysis, the paper proves that in a connected prompt space, no defense wrapper that is both continuous and utility-preserving can guarantee absolute security. Leveraging tools from topology, Lipschitz regularity, transversality conditions, and discrete mathematics, the theoretical framework is extended to multi-turn interactions and randomized defense settings, with key results mechanically verified in Lean 4. Empirical evaluations across three large language models corroborate the theoretical predictions, rigorously characterizing for the first time the precise boundary conditions under which such defenses fail, thereby establishing a solid theoretical foundation for secure alignment.

Technology Category

Application Category

📝 Abstract

We prove that no continuous, utility-preserving wrapper defense-a function $D: X\to X$ that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an $ε$-robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These constitute a defense trilemma: continuity, utility preservation, and completeness cannot coexist. We prove parallel discrete results requiring no topology, and extend to multi-turn interactions, stochastic defenses, and capacity-parity settings. The results do not preclude training-time alignment, architectural changes, or defenses that sacrifice utility. The full theory is mechanically verified in Lean 4 and validated empirically on three LLMs.

Problem

Research questions and friction points this paper is trying to address.

prompt injection

defense trilemma

utility preservation

language model safety

wrapper defense

Innovation

Methods, ideas, or system contributions that make the work stand out.

Defense Trilemma

Prompt Injection

Wrapper Defense