Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

📅 2024-10-25
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the “safety compensation” problem in LLM safety alignment—where models simultaneously exhibit over-conservatism and severe violations across individual prompts despite satisfying safety constraints in expectation. We propose the first prompt-level enforced safety constraint framework based on reinforcement learning. Our core innovation replaces global expected safety constraints with sample-wise critical safety penalties and introduces a correction-aware policy gradient mechanism, enabling precise, comprehensive safety control within both RLHF and RePO frameworks. Experiments demonstrate that our method significantly outperforms state-of-the-art baselines across multiple safety benchmarks, achieving substantial improvements in safety coverage and response consistency while preserving high utility.

Technology Category

Application Category

📝 Abstract
Balancing helpfulness and safety (harmlessness) is a critical challenge in aligning large language models (LLMs). Current approaches often decouple these two objectives, training separate preference models for helpfulness and safety, while framing safety as a constraint within a constrained Markov Decision Process (CMDP) framework. This paper identifies a potential issue when using the widely adopted expected safety constraints for LLM safety alignment, termed"safety compensation", where the constraints are satisfied on expectation, but individual prompts may trade off safety, resulting in some responses being overly restrictive while others remain unsafe. To address this issue, we propose Rectified Policy Optimization (RePO), which replaces the expected safety constraint with critical safety constraints imposed on every prompt. At the core of RePO is a policy update mechanism driven by rectified policy gradients, which penalizes the strict safety violation of every prompt, thereby enhancing safety across nearly all prompts. Our experiments demonstrate that RePO outperforms strong baseline methods and significantly enhances LLM safety alignment.
Problem

Research questions and friction points this paper is trying to address.

Balancing helpfulness and safety in LLMs
Addressing safety compensation issue in LLMs
Proposing Rectified Policy Optimization for LLM safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rectified Policy Optimization
Critical safety constraints
Penalizes safety violations
🔎 Similar Papers
No similar papers found.
X
Xiyue Peng
ShanghaiTech University
H
Hengquan Guo
ShanghaiTech University
J
Jiawei Zhang
SenseTime Research
D
Dongqing Zou
SenseTime Research
Ziyu Shao
Ziyu Shao
ShanghaiTech University
Honghao Wei
Honghao Wei
Assistant Professor of EECS, Washington State University
Reinforcement LearningOptimizationSafe-RL
X
Xin Liu
ShanghaiTech University