LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the limited generalization of current large language models in defending against prompt injection attacks, particularly their vulnerability to subtle attacks that produce outputs nearly correct yet fundamentally flawed. To enhance robustness, the authors propose an alignment training method based on automatically generated “near-target” adversarial examples. This approach incorporates a margin-aware weighting mechanism that dynamically adjusts sample weights during training and leverages prompt engineering to enable single-step adversarial example generation. By sharpening the semantic boundary between instruction and data regions, the method significantly improves the model’s resilience and generalization against stealthy prompt injection attacks, outperforming existing defense strategies in real-world scenarios.

📝 Abstract

Large language models are increasingly embedded into systems that interact with user data, retrieved web content, and external tools, creating a new attack surface: prompt injection, where malicious commands embedded in untrusted data override the trusted command and induce unintended behavior. Existing defenses mainly rely on fine-tuning the model to preserve an explicit boundary between trusted commands and the untrusted data portion, so that the model learns to prioritize the trusted field and ignore malicious commands in data. However, we observe that while these defenses can block obviously malicious responses caused by injected commands, they generalize poorly to real-world scenarios where the model's response to the injected command is much nearer to the correct response. This is because existing methods typically train against only a fixed set of hand-crafted attack targets, which yields a loose boundary around the correct response and leaves it easier to bypass. To address this challenge, we propose LocalAlign, a more generalizable prompt injection defense inspired by adversarial training. LocalAlign automatically and efficiently generates adversarial examples in which the command embedded in the data portion induces a response that stays near to the correct response while still being wrong. We generate such near-but-wrong adversarial examples using prompting and a single inference step. This design enforces a tighter robustness boundary around the correct response: even small response shifts induced by commands in untrusted data are explicitly penalized. Moreover, the resulting adversarial examples can vary substantially in quality across samples. To address this issue, we further introduce a margin-aware alignment algorithm that quantifies each sample's distance to the correct response and assigns larger training weight to nearer ones.

Problem

Research questions and friction points this paper is trying to address.

prompt injection

adversarial examples

generalization

alignment training

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt injection defense

adversarial training

near-target adversarial examples