🤖 AI Summary
This work identifies the “incomplete learning” problem in safety alignment of large language models (LLMs): position-dependent gradient decay during autoregressive training leads to insufficient coverage of safety preferences in the latter half of model responses, inducing systematic vulnerabilities. We introduce *base-favored tokens* as a quantitative metric and empirically demonstrate— for the first time—that safety signal strength decays with token position. To address this, we propose an adaptive, position-aware penalty mechanism and a hybrid teacher distillation framework to precisely reinforce under-aligned response segments. Evaluated on Llama and Qwen families, our approach reduces adversarial attack success rates by 48%–98% while preserving general capabilities (e.g., MMLU, BBH). This is the first study to incorporate gradient dynamics modeling into safety alignment analysis, establishing a novel paradigm for trustworthy generative AI.
📝 Abstract
Large language models exhibit systematic vulnerabilities to adversarial attacks despite extensive safety alignment. We provide a mechanistic analysis revealing that position-dependent gradient weakening during autoregressive training creates signal decay, leading to incomplete safety learning where safety training fails to transform model preferences in later response regions fully. We introduce base-favored tokens -- vocabulary elements where base models assign higher probability than aligned models -- as computational indicators of incomplete safety learning and develop a targeted completion method that addresses undertrained regions through adaptive penalties and hybrid teacher distillation. Experimental evaluation across Llama and Qwen model families demonstrates dramatic improvements in adversarial robustness, with 48--98% reductions in attack success rates while preserving general capabilities. These results establish both a mechanistic understanding and practical solutions for fundamental limitations in safety alignment methodologies.