🤖 AI Summary
Large language models (LLMs) are vulnerable to prompt injection attacks because instructions and input data are not explicitly separated in the embedding space. To address this, we propose the first architecture-level dual-path embedding mechanism that strictly decouples instruction and data representations without requiring pretraining or safety fine-tuning. Our approach orthogonally rotates and reuses the original embedding layer to construct two parallel embedding streams—one for instructions and one for data—followed by orthogonal transformation and representation-space analysis. Experiments demonstrate a significant increase in instruction-data separation while preserving zero degradation in primary task performance. On standard prompt injection benchmarks, our method achieves state-of-the-art defense efficacy, confirming that architectural-level separation provides fundamental robustness gains against adversarial prompting.
📝 Abstract
Despite their remarkable performance, large language models lack elementary safety features, and this makes them susceptible to numerous malicious attacks. In particular, previous work has identified the absence of an intrinsic separation between instructions and data as a root cause for the success of prompt injection attacks. In this work, we propose an architectural change, ASIDE, that allows the model to clearly separate between instructions and data by using separate embeddings for them. Instead of training the embeddings from scratch, we propose a method to convert an existing model to ASIDE form by using two copies of the original model's embeddings layer, and applying an orthogonal rotation to one of them. We demonstrate the effectiveness of our method by showing (1) highly increased instruction-data separation scores without a loss in model capabilities and (2) competitive results on prompt injection benchmarks, even without dedicated safety training. Additionally, we study the working mechanism behind our method through an analysis of model representations.