Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Large language models (LLMs) suffer from “shallow alignment”: they reliably reject harmful queries only at the initial generation step, but their safety collapses rapidly once adversarial content is sustained—e.g., under adversarial prefix or prefill attacks. This work proposes Any-Depth Alignment (ADA), a fine-tuning-free, parameter-invariant inference-time defense. ADA’s core insight is the identification and exploitation of alignment priors encoded in the assistant head token; it dynamically re-injects this token in a streaming fashion to trigger on-the-fly harm reassessment at arbitrary generation depths. Evaluated across adversarial sequences spanning dozens to thousands of tokens, ADA achieves near-perfect (≈100%) rejection rates and reduces success rates of mainstream jailbreaking attacks to below 3%. Crucially, it incurs zero degradation in standard task performance, while offering low computational overhead, strong generalization across models and attack types, and seamless plug-and-play deployment.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) exhibit strong but shallow alignment: they directly refuse harmful queries when a refusal is expected at the very start of an assistant turn, yet this protection collapses once a harmful continuation is underway (either through the adversarial attacks or via harmful assistant-prefill attacks). This raises a fundamental question: Can the innate shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation depths? To achieve this goal, we propose Any-Depth Alignment (ADA), an effective inference-time defense with negligible overhead. ADA is built based on our observation that alignment is concentrated in the assistant header tokens through repeated use in shallow-refusal training, and these tokens possess the model's strong alignment priors. By reintroducing these tokens mid-stream, ADA induces the model to reassess harmfulness and recover refusals at any point in generation. Across diverse open-source model families (Llama, Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety performance without requiring any changes to the base model's parameters. It secures a near-100% refusal rate against challenging adversarial prefill attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces the average success rate of prominent adversarial prompt attacks (such as GCG, AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while preserving utility on benign tasks with minimal over-refusal. ADA maintains this resilience even after the base model undergoes subsequent instruction tuning (benign or adversarial).

Problem

Research questions and friction points this paper is trying to address.

LLMs exhibit shallow safety alignment that collapses during generation

Harmful content protection fails once harmful continuation begins

Need to unlock innate alignment for safety at arbitrary generation depths

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reintroduces alignment tokens mid-generation for reassessment

Enables safety at any generation depth without retraining

Secures near-perfect refusal rates against adversarial attacks

🔎 Similar Papers

Safety Layers in Aligned Large Language Models: The Key to LLM Security