π€ AI Summary
Large language models are vulnerable to jailbreaking and prompt injection attacks, yet existing defenses often compromise user experience through high false rejection rates and lack safety guarantees during generation. This work proposes a training-free defense mechanism that tightens the decision boundary by introducing dual βacceptβ and βrejectβ anchors during detection and pre-injects rejection tokens during mitigation to ensure the safety of the first generated token. Requiring only 20 example templates, the method is compatible with mainstream architectures and arbitrary sampling strategies. Experiments show that, compared to GradSafe, it reduces false rejection rates by 52%, lowers attack success rates by up to 10%, and incurs an average latency increase of merely 15β20 milliseconds. The approach also successfully transfers to LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B.
π Abstract
Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak and prompt injection detection such as GradSafe, detects unsafe prompts with a single "accept all" anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines an acceptance anchor token ("Sure") and refusal anchor token ("Sorry") tightening the decision boundary and significantly lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens ("Sorry, I can't...") before autoregressive decoding resumes, guaranteeing first-token safety regardless of sampling strategy. On ToxicChat, XSTest-v2, and AdvBench, GCD reduces false positives by 52% vs. GradSafe at comparable recall, lowers attack success rate by up to 10% vs. the strongest decoding-only baseline, adds under 15-20 ms latency on an average on V100 instances, transfers to LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B, and requires only 20 demonstration templates.