Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Large language models often suffer degraded reasoning capabilities—referred to as the “safety tax”—during safety alignment due to distributional mismatch. This work proposes OPSA, the first approach to incorporate on-policy self-distillation into safety alignment: the model generates trajectories based on its own policy, while a frozen teacher model provides token-wise KL divergence supervision within privileged safe contexts. Crucially, the authors introduce a novel metric, “teacher flip rate,” to identify contexts that elicit latent safe reasoning rather than merely surface-level compliance. Experiments demonstrate that OPSA consistently outperforms off-policy and external-teacher distillation baselines across multiple model families and scales, achieving gains of up to 8.85 points on smaller models and exhibiting robustness across varying training budgets and adversarial jailbreaking evaluations.

📝 Abstract

Safety alignment often improves robustness to harmful queries at the cost of reasoning ability, a tradeoff known as the safety tax. A common cause is distributional mismatch: supervised fine-tuning trains the target model on safety demonstrations produced by humans, external models, or fixed self-generated traces, rather than on trajectories sampled from its own policy. We identify off-policy training mismatch as a second source of this tax and study on-policy self-distillation for safety alignment, which we call OPSA. The model generates its own rollouts and receives dense per-token KL supervision from a frozen teacher copy of itself conditioned on a privileged safety context. Because this teacher must be safer than the sampled student trajectory, we introduce \emph{teacher flip rate}: a criterion that measures how often a privileged context converts unsafe responses into safe ones. We use this signal to search for contexts that activate latent safety reasoning rather than merely elicit safe-looking demonstrations. Across two reasoning-model families and five model scales, OPSA achieves a stronger safety--reasoning tradeoff than off-policy self-distillation and external-teacher distillation under matched data and full-parameter fine-tuning, with the largest gains on smaller models (+8.85 points on R1-Distill-1.5B and +5.49 points on Qwen3-0.6B). The gains persist across training-set sizes and adaptive jailbreak evaluations. Token-level analyses further show that OPSA concentrates updates near early compliance-decision tokens, providing a mechanism for improving safety while preserving general reasoning.

Problem

Research questions and friction points this paper is trying to address.

safety tax

distributional mismatch

safety alignment

reasoning ability

off-policy training

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy self-distillation

safety alignment

teacher flip rate