🤖 AI Summary
This paper addresses the reliability degradation of large language model (LLM) responses caused by prompt injection attacks. Methodologically, it proposes an injection-resistant dual-channel Transformer architecture: (1) a structured prompt isolation paradigm that separates trusted system instructions and untrusted user inputs into distinct channels; (2) a gated fusion mechanism coupled with a provably invariant system-instruction branch to preserve instruction integrity; and (3) integration of a Mixture-of-Experts (MoE) security expert module with cybersecurity knowledge graph (CKG)-guided dynamic reasoning for domain-aware defense. Experiments demonstrate a 99.2% defense success rate against representative attacks such as Policy Puppetry, zero-shot cross-domain transfer capability, and a complete deployment pipeline—from pretraining to efficient fine-tuning—ensuring practical applicability and robustness.
📝 Abstract
We propose a robust transformer architecture designed to prevent prompt injection attacks and ensure secure, reliable response generation. Our PICO (Prompt Isolation and Cybersecurity Oversight) framework structurally separates trusted system instructions from untrusted user inputs through dual channels that are processed independently and merged only by a controlled, gated fusion mechanism. In addition, we integrate a specialized Security Expert Agent within a Mixture-of-Experts (MoE) framework and incorporate a Cybersecurity Knowledge Graph (CKG) to supply domain-specific reasoning. Our training design further ensures that the system prompt branch remains immutable while the rest of the network learns to handle adversarial inputs safely. This PICO framework is presented via a general mathematical formulation, then elaborated in terms of the specifics of transformer architecture, and fleshed out via hypothetical case studies including Policy Puppetry attacks. While the most effective implementation may involve training transformers in a PICO-based way from scratch, we also present a cost-effective fine-tuning approach.