Rethinking Deep Alignment Through The Lens Of Incomplete Learning

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work identifies the “incomplete learning” problem in safety alignment of large language models (LLMs): position-dependent gradient decay during autoregressive training leads to insufficient coverage of safety preferences in the latter half of model responses, inducing systematic vulnerabilities. We introduce *base-favored tokens* as a quantitative metric and empirically demonstrate— for the first time—that safety signal strength decays with token position. To address this, we propose an adaptive, position-aware penalty mechanism and a hybrid teacher distillation framework to precisely reinforce under-aligned response segments. Evaluated on Llama and Qwen families, our approach reduces adversarial attack success rates by 48%–98% while preserving general capabilities (e.g., MMLU, BBH). This is the first study to incorporate gradient dynamics modeling into safety alignment analysis, establishing a novel paradigm for trustworthy generative AI.

Technology Category

Application Category

📝 Abstract

Large language models exhibit systematic vulnerabilities to adversarial attacks despite extensive safety alignment. We provide a mechanistic analysis revealing that position-dependent gradient weakening during autoregressive training creates signal decay, leading to incomplete safety learning where safety training fails to transform model preferences in later response regions fully. We introduce base-favored tokens -- vocabulary elements where base models assign higher probability than aligned models -- as computational indicators of incomplete safety learning and develop a targeted completion method that addresses undertrained regions through adaptive penalties and hybrid teacher distillation. Experimental evaluation across Llama and Qwen model families demonstrates dramatic improvements in adversarial robustness, with 48--98% reductions in attack success rates while preserving general capabilities. These results establish both a mechanistic understanding and practical solutions for fundamental limitations in safety alignment methodologies.

Problem

Research questions and friction points this paper is trying to address.

Analyzing incomplete safety learning causing alignment vulnerabilities in LLMs

Identifying base-favored tokens as indicators of undertrained safety regions

Developing targeted completion method to enhance adversarial robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Targeted completion method addresses undertrained safety regions

Adaptive penalties and hybrid teacher distillation enhance robustness

Base-favored tokens serve as computational indicators for alignment

🔎 Similar Papers

No similar papers found.