Cat-DPO: Category-Adaptive Safety Alignment

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing safety alignment methods often reduce safety to a single scalar metric, struggling to balance performance across diverse harm categories. This work proposes Cat-DPO, a novel algorithm that formulates safety alignment as a constrained optimization problem partitioned by harm category. It introduces, for the first time, a category-adaptive safety margin mechanism that enables training signals to dynamically track the current difficulty level of each harm category. Built upon Direct Preference Optimization (DPO), Cat-DPO incorporates a category-aware strategy for dynamically adjusting safety margins. Evaluated on two large language models against six baseline methods, the approach significantly enhances both overall helpfulness and harmlessness while effectively reducing performance disparities across harm categories and narrowing the gap between best- and worst-case safety outcomes.

Technology Category

Application Category

📝 Abstract

Aligning large language models with human preferences must balance two competing goals: responding helpfully to legitimate requests and reliably refusing harmful ones. Most preference-based safety alignment methods collapse safety into a single scalar that is applied uniformly to every preference pair. The result is a model that looks safe on average but stays relatively unsafe on a minority of harm categories. We cast safety alignment as a per-category constrained optimization problem and derive Cat-DPO, a direct-preference-optimization algorithm with a separate adaptive safety margin for each harm category. The margin tightens when the model still produces unsafe responses on a category and relaxes once the model catches up, so the training signal tracks each category's current difficulty rather than averaging under one global rate. Across two LLM backbones and six preference-learning baselines, Cat-DPO iimproves aggregate helpfulness and harmlessness and compresses per-category safety variance and the best-to-worst gap, offering a drop-in per-category refinement of direct preference safety alignment.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

harm categories

preference optimization

large language models

category-adaptive

Innovation

Methods, ideas, or system contributions that make the work stand out.

Category-Adaptive Safety

Direct Preference Optimization

Per-Category Constrained Optimization