🤖 AI Summary
Despite safety alignment, large language models (LLMs) remain vulnerable to jailbreaking attacks; moreover, existing alignment methods are often shallow—harmful outputs are frequently determined within the first few generated tokens. To address this, we propose the “single-trigger-token mechanism,” the first approach to identify and exploit a common pattern among safety-triggering tokens: a single, carefully selected token—applied early in decoding—precisely activates the model’s internal safety response mode. Our method relies on token-level behavioral analysis and constrained decoding, requiring no fine-tuning or additional training and remaining compatible with diverse safety-aligned LLMs. Experiments across multiple jailbreaking attack scenarios demonstrate that our approach significantly reduces harmful output rates while preserving original task performance and introducing negligible latency overhead. It consistently outperforms ten state-of-the-art baseline methods, establishing a new benchmark for lightweight, effective safety enforcement in aligned LLMs.
📝 Abstract
Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research. However, they remain vulnerable to jailbreak attacks, which manipulate the models into generating harmful responses despite safety alignment. Recent studies have shown that current safety-aligned LLMs often undergo the shallow safety alignment, where the first few tokens largely determine whether the response will be harmful. Through comprehensive observations, we find that safety-aligned LLMs and various defense strategies generate highly similar initial tokens in their refusal responses, which we define as safety trigger tokens. Building on this insight, we propose exttt{D-STT}, a simple yet effective defense algorithm that identifies and explicitly decodes safety trigger tokens of the given safety-aligned LLM to trigger the model's learned safety patterns. In this process, the safety trigger is constrained to a single token, which effectively preserves model usability by introducing minimum intervention in the decoding process. Extensive experiments across diverse jailbreak attacks and benign prompts demonstrate that ours significantly reduces output harmfulness while preserving model usability and incurring negligible response time overhead, outperforming ten baseline methods.