Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Large language models are vulnerable to jailbreaking and prompt injection attacks, yet existing defenses often compromise user experience through high false rejection rates and lack safety guarantees during generation. This work proposes a training-free defense mechanism that tightens the decision boundary by introducing dual “accept” and “reject” anchors during detection and pre-injects rejection tokens during mitigation to ensure the safety of the first generated token. Requiring only 20 example templates, the method is compatible with mainstream architectures and arbitrary sampling strategies. Experiments show that, compared to GradSafe, it reduces false rejection rates by 52%, lowers attack success rates by up to 10%, and incurs an average latency increase of merely 15–20 milliseconds. The approach also successfully transfers to LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak and prompt injection detection such as GradSafe, detects unsafe prompts with a single "accept all" anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines an acceptance anchor token ("Sure") and refusal anchor token ("Sorry") tightening the decision boundary and significantly lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens ("Sorry, I can't...") before autoregressive decoding resumes, guaranteeing first-token safety regardless of sampling strategy. On ToxicChat, XSTest-v2, and AdvBench, GCD reduces false positives by 52% vs. GradSafe at comparable recall, lowers attack success rate by up to 10% vs. the strongest decoding-only baseline, adds under 15-20 ms latency on an average on V100 instances, transfers to LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B, and requires only 20 demonstration templates.

Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks

prompt injection

false positives

safety guardrail

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-Controlled Decoding

dual-anchor steering

training-free guardrail