Continual Safety Alignment via Gradient-Based Sample Selection

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the degradation of safety alignment in large language models during continual learning, where models increasingly fail to reject adversarial requests or generate factually incorrect content. From a data-centric perspective, the study identifies that high-gradient samples are prone to induce alignment drift, whereas medium-gradient samples support effective task learning while preserving alignment. Building on this insight, the authors propose a gradient-driven sample filtering mechanism that removes high-gradient samples during fine-tuning, without requiring additional safety data or architectural modifications. Experiments across multiple models and task sequences demonstrate that the method significantly enhances alignment retention while maintaining strong task performance, exhibiting robustness to variations in sample selection ratios and task ordering.

Technology Category

Application Category

📝 Abstract

Large language models require continuous adaptation to new tasks while preserving safety alignment. However, fine-tuning on even benign data often compromises safety behaviors, including refusal of harmful requests, truthfulness, and commonsense reasoning. We investigate which training samples cause alignment drift through a data-centric lens. Our empirical analysis shows samples contribute unequally: high-gradient samples cause greater safety degradation and drive models toward pretrained distributions, while moderate-gradient samples enable task learning with minimal alignment loss. We propose gradient-based sample selection that filters high-gradient samples during fine-tuning. Across multiple model families on continual domain tasks, our method substantially improves alignment preservation while maintaining competitive task performance, without requiring curated safe data or architectural modifications. Our method is robust across selection ratios, task orderings, and diverse attack benchmarks.

Problem

Research questions and friction points this paper is trying to address.

continual learning

safety alignment

large language models

alignment drift

fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

gradient-based sample selection

continual safety alignment

alignment drift