🤖 AI Summary
This work reveals a fundamental “defense trilemma” confronting current input-preprocessing-based defenses against prompt injection attacks: it is impossible to simultaneously achieve continuity, utility preservation, and perfect security. Through formal analysis, the paper proves that in a connected prompt space, no defense wrapper that is both continuous and utility-preserving can guarantee absolute security. Leveraging tools from topology, Lipschitz regularity, transversality conditions, and discrete mathematics, the theoretical framework is extended to multi-turn interactions and randomized defense settings, with key results mechanically verified in Lean 4. Empirical evaluations across three large language models corroborate the theoretical predictions, rigorously characterizing for the first time the precise boundary conditions under which such defenses fail, thereby establishing a solid theoretical foundation for secure alignment.
📝 Abstract
We prove that no continuous, utility-preserving wrapper defense-a function $D: X\to X$ that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an $ε$-robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These constitute a defense trilemma: continuity, utility preservation, and completeness cannot coexist. We prove parallel discrete results requiring no topology, and extend to multi-turn interactions, stochastic defenses, and capacity-parity settings. The results do not preclude training-time alignment, architectural changes, or defenses that sacrifice utility. The full theory is mechanically verified in Lean 4 and validated empirically on three LLMs.