Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the “alignment tax” commonly observed in large language models during safety alignment, where safety fine-tuning degrades general capabilities such as reasoning and coding. Framing safety alignment as a continual learning problem, the authors propose Orthogonal Gradient Projection for Safe Alignment (OGPSA), a lightweight, plug-and-play method that requires no replay data, auxiliary objectives, or retraining. OGPSA leverages low-rank subspace estimation to orthogonally project gradients, effectively decoupling safety updates from the preservation of general abilities within standard post-training pipelines like supervised fine-tuning (SFT) and direct preference optimization (DPO). Experiments demonstrate that OGPSA significantly advances the safety–utility Pareto frontier on models such as Qwen2.5-7B-Instruct, improving SimpleQA accuracy from 0.53% to 3.03% and IFEval performance from 51.94% to 63.96%.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) often incur an alignment tax: safety post-training can reduce general utility (e.g., reasoning and coding). We argue that this tax primarily arises from continual-learning-style forgetting in sequential alignment, where distribution shift and conflicting objectives cause safety updates to overwrite pre-trained competencies. Accordingly, we cast safety alignment as a continual learning (CL) problem that must balance plasticity (acquiring safety constraints) and stability (preserving general abilities). We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight method that mitigates interference by constraining each safety update to be orthogonal (in a first-order sense) to a learned subspace capturing general capabilities. Specifically, OGPSA estimates a low-rank capability subspace from gradients on a small reference set and projects the safety gradient onto its orthogonal complement before updating. This produces safety-directed updates that minimally perturb prior knowledge while retaining capacity for alignment. OGPSA is plug-and-play and integrates into standard post-training pipelines without large-scale replay, auxiliary objectives, or retraining. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT$\rightarrow$DPO settings, OGPSA consistently improves the safety--utility Pareto frontier over standard baselines. For instance, on Qwen2.5-7B-Instruct under SFT$\rightarrow$DPO, OGPSA preserves strong safety while recovering general capability, improving SimpleQA from 0.53\% to 3.03\% and IFEval from 51.94\% to 63.96\%. Our source code is available at \href{https://github.com/SunGL001/OGPSA}{OGPSA}

Problem

Research questions and friction points this paper is trying to address.

alignment tax

continual learning

safety alignment

catastrophic forgetting

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Orthogonal Gradient Projection

Safety Alignment

Continual Learning