Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

📅 2026-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the “alignment tax” commonly observed in large language models during safety alignment, where safety fine-tuning degrades general capabilities such as reasoning and coding. Framing safety alignment as a continual learning problem, the authors propose Orthogonal Gradient Projection for Safe Alignment (OGPSA), a lightweight, plug-and-play method that requires no replay data, auxiliary objectives, or retraining. OGPSA leverages low-rank subspace estimation to orthogonally project gradients, effectively decoupling safety updates from the preservation of general abilities within standard post-training pipelines like supervised fine-tuning (SFT) and direct preference optimization (DPO). Experiments demonstrate that OGPSA significantly advances the safety–utility Pareto frontier on models such as Qwen2.5-7B-Instruct, improving SimpleQA accuracy from 0.53% to 3.03% and IFEval performance from 51.94% to 63.96%.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) often incur an alignment tax: safety post-training can reduce general utility (e.g., reasoning and coding). We argue that this tax primarily arises from continual-learning-style forgetting in sequential alignment, where distribution shift and conflicting objectives cause safety updates to overwrite pre-trained competencies. Accordingly, we cast safety alignment as a continual learning (CL) problem that must balance plasticity (acquiring safety constraints) and stability (preserving general abilities). We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight method that mitigates interference by constraining each safety update to be orthogonal (in a first-order sense) to a learned subspace capturing general capabilities. Specifically, OGPSA estimates a low-rank capability subspace from gradients on a small reference set and projects the safety gradient onto its orthogonal complement before updating. This produces safety-directed updates that minimally perturb prior knowledge while retaining capacity for alignment. OGPSA is plug-and-play and integrates into standard post-training pipelines without large-scale replay, auxiliary objectives, or retraining. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT$\rightarrow$DPO settings, OGPSA consistently improves the safety--utility Pareto frontier over standard baselines. For instance, on Qwen2.5-7B-Instruct under SFT$\rightarrow$DPO, OGPSA preserves strong safety while recovering general capability, improving SimpleQA from 0.53\% to 3.03\% and IFEval from 51.94\% to 63.96\%. Our source code is available at \href{https://github.com/SunGL001/OGPSA}{OGPSA}
Problem

Research questions and friction points this paper is trying to address.

alignment tax
continual learning
safety alignment
catastrophic forgetting
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Orthogonal Gradient Projection
Safety Alignment
Continual Learning
Alignment Tax
Pareto Frontier
🔎 Similar Papers