🤖 AI Summary
This work addresses the limitations of existing safety alignment methods for language models, which often rely on complex training pipelines or heuristic strategies lacking theoretical foundations. The authors reformulate safety alignment as a density ratio matching problem and propose a single-stage optimization framework based on Bregman divergence (BSO). By leveraging convex generating functions to construct the loss, BSO enables safe policy learning within the direct preference optimization paradigm without requiring auxiliary models. The approach introduces only a single hyperparameter, unifying and generalizing several existing alignment techniques while offering both theoretical guarantees and implementation simplicity. Experimental results demonstrate that BSO significantly improves the trade-off between safety and helpfulness across multiple safety alignment benchmarks.
📝 Abstract
Aligning language models for both helpfulness and safety typically requires complex pipelines-separate reward and cost models, online reinforcement learning, and primal-dual updates. Recent direct preference optimization approaches simplify training but incorporate safety through ad-hoc modifications such as multi-stage procedures or heuristic margin terms, lacking a principled derivation. We show that the likelihood ratio of the optimal safe policy admits a closed-form decomposition that reduces safety alignment to a density ratio matching problem. Minimizing Bregman divergences between the data and model ratios yields Bregman Safety Optimization (BSO), a family of single-stage loss functions, each induced by a convex generator, that provably recover the optimal safe policy. BSO is both general and simple: it requires no auxiliary models, introduces only one hyperparameter beyond standard preference optimization, and recovers existing safety-aware methods as special cases. Experiments across safety alignment benchmarks show that BSO consistently improves the safety-helpfulness trade-off.