PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference

📅 2024-06-20
📈 Citations: 50
Influential: 3
📄 PDF
🤖 AI Summary
This work addresses the core challenge in large language model (LLM) safety alignment: the tight coupling—and consequent difficulty in disentangling—*helpfulness* and *harmlessness*. Methodologically, we introduce a multi-level, fine-grained safety alignment framework: (1) we construct PKU-SafeRLHF, the first open-source dataset comprising 44.6k prompts and 265k question-answer pairs with dual-dimensional annotations across 19 harm categories and 3 severity levels; (2) we propose the first explicit preference annotation paradigm that decouples helpfulness from harmlessness, yielding 166.8k dual-dimensional preference labels; and (3) we integrate human annotation, severity-sensitive safety classifiers, and safety-oriented RLHF training. Our contributions include releasing the first publicly available multi-level safety preference dataset and enabling the training of risk-aware safety evaluators and safety-enhanced policies—achieving over 32% improvement in harmful response suppression across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Develop a safety human preference dataset for LLM alignment
Decouple helpfulness and harmlessness in QA pair annotations
Train safety-centric RLHF algorithms for risk control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Safety human preference dataset for LLMs
Separate annotations for helpfulness and harmlessness
Severity-sensitive moderation and safety-centric RLHF
🔎 Similar Papers
No similar papers found.
J
Jiaming Ji
Peking University
Donghai Hong
Donghai Hong
Peking University
AI SafetyAI AlignmentMulti-Modal Model
Borong Zhang
Borong Zhang
University of Macau
Reinforcement learningRobotics
B
Boyuan Chen
Peking University
Josef Dai
Josef Dai
Zhejiang University
Alignment
B
Boren Zheng
Peking University
T
Tianyi Qiu
Peking University
B
Boxun Li
Infinigence-AI
Y
Yaodong Yang
Peking University