Curriculum Learning for Safety Alignment

πŸ“… 2026-05-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the fragility of Direct Preference Optimization (DPO) in safety alignment and its poor generalization under out-of-distribution (OOD) scenarios. The authors propose Staged-Competence, a novel framework that introduces curriculum learning into safety alignment by organizing preference data according to difficulty levels, dynamically sampling based on the model’s current competence, and progressively updating the reference model during training. Compatible with multiple DPO variants, this approach achieves baseline safety performance using only 75% of the original data while substantially improving the model’s ability to discriminate between safe and unsafe responses. Experiments across three model families demonstrate an average 16% reduction in OOD harmful response rates and a 20% decrease in jailbreak attack success rates, all while preserving general capabilities and exhibiting minimal over-refusal.
πŸ“ Abstract
Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle and exhibits poor out-of-distribution (OOD) generalisation. In this paper, we investigate whether Curriculum Learning can improve the robustness of DPO-based safety alignment. We propose Staged-Competence, a curriculum-based framework that organises preference data by difficulty, employs competence-based sampling, and progressively updates the reference model during training. Averaged across three model families, Staged-Competence reduces OOD harmful response rates by 16% and jailbreak attack success rates by 20%, while preserving general capabilities with near-zero over-refusal. We further show that Staged-Competence (1) matches baseline safety with only 75% of the training data and (2) yields better separation between safe and unsafe responses. Staged-Competence is agnostic to the policy optimisation loss and can extend to other DPO variants and alignment domains. Our code and data are available at https://github.com/Sandeep5500/curriculum-learning-for-safety.
Problem

Research questions and friction points this paper is trying to address.

Safety Alignment
Direct Preference Optimisation
Out-of-Distribution Generalisation
Robustness
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum Learning
Direct Preference Optimisation
Safety Alignment
Out-of-Distribution Generalisation
Competence-based Sampling
πŸ”Ž Similar Papers