🤖 AI Summary
This work addresses the core challenge in large language model (LLM) safety alignment: the tight coupling—and consequent difficulty in disentangling—*helpfulness* and *harmlessness*. Methodologically, we introduce a multi-level, fine-grained safety alignment framework: (1) we construct PKU-SafeRLHF, the first open-source dataset comprising 44.6k prompts and 265k question-answer pairs with dual-dimensional annotations across 19 harm categories and 3 severity levels; (2) we propose the first explicit preference annotation paradigm that decouples helpfulness from harmlessness, yielding 166.8k dual-dimensional preference labels; and (3) we integrate human annotation, severity-sensitive safety classifiers, and safety-oriented RLHF training. Our contributions include releasing the first publicly available multi-level safety preference dataset and enabling the training of risk-aware safety evaluators and safety-enhanced policies—achieving over 32% improvement in harmful response suppression across multiple benchmarks.
📝 Abstract
In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs.