Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

πŸ“… 2026-04-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large language models are vulnerable to jailbreaking and prompt injection attacks, yet existing defenses often compromise user experience through high false rejection rates and lack safety guarantees during generation. This work proposes a training-free defense mechanism that tightens the decision boundary by introducing dual β€œaccept” and β€œreject” anchors during detection and pre-injects rejection tokens during mitigation to ensure the safety of the first generated token. Requiring only 20 example templates, the method is compatible with mainstream architectures and arbitrary sampling strategies. Experiments show that, compared to GradSafe, it reduces false rejection rates by 52%, lowers attack success rates by up to 10%, and incurs an average latency increase of merely 15–20 milliseconds. The approach also successfully transfers to LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B.
πŸ“ Abstract
Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak and prompt injection detection such as GradSafe, detects unsafe prompts with a single "accept all" anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines an acceptance anchor token ("Sure") and refusal anchor token ("Sorry") tightening the decision boundary and significantly lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens ("Sorry, I can't...") before autoregressive decoding resumes, guaranteeing first-token safety regardless of sampling strategy. On ToxicChat, XSTest-v2, and AdvBench, GCD reduces false positives by 52% vs. GradSafe at comparable recall, lowers attack success rate by up to 10% vs. the strongest decoding-only baseline, adds under 15-20 ms latency on an average on V100 instances, transfers to LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B, and requires only 20 demonstration templates.
Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks
prompt injection
false positives
safety guardrail
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-Controlled Decoding
dual-anchor steering
training-free guardrail
prompt injection defense
first-token safety
πŸ”Ž Similar Papers
No similar papers found.
Purva Chiniya
Purva Chiniya
Amazon
K
Kevin Scaria
Amazon Alexa
S
Sagar Chaturvedi
Amazon AGI