π€ AI Summary
Existing defenses struggle to mitigate prompt injection attacks that manipulate context while preserving the modelβs normal functionality. This work introduces, for the first time, the theory of Contextual Integrity (CI) from privacy research into the analysis of prompt injections, formally modeling how such attacks subvert AI behavior by distorting information flows, altering contextual norms, or blending multiple contextual sources to violate situational expectations. Building on this framework, we devise three novel attack strategies and demonstrate that current defenses cover only a limited portion of the attack surface, as adversaries can always construct seemingly legitimate contexts to bypass protections. Our findings reveal a fundamental intractability in fully preventing prompt injections and establish CI-based principles for evaluating and designing future aligned autonomous agents.
π Abstract
Prompt injection is the most critical vulnerability in deployed AI agents. Despite recent progress, we show that the prevailing defense paradigm (data-instruction separation) both fails to detect attacks that operate through contextual manipulation and degrades contextually appropriate behavior. We then recast prompt injection via the lens of Contextual Integrity (CI), a privacy theory that judges information flow compliance with contextual norms. This explains types of attacks that current defenses attempt to patch and predict advanced ones future agents will face. We develop unique benign and attack scenarios that force an agent to violate the norms by (1) misrepresenting the flow, (2) manipulating norms, or (3) mixing multiple flows. This reframing suggests an impossibility result: an adversary can always construct a context under which a blocked flow appears legitimate, or a defender who tightens norms will block genuinely legitimate flows. Our findings suggest that current research addresses a shrinking fraction of future attack surfaces. Instead, through CI, we offer a principled framework for evaluating context-sensitive failures, and designing CI-aware alignment for the frontier autonomous agents.