Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

📅 2024-08-27
🏛️ arXiv.org
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
Existing RLHF-based alignment methods for LLMs suffer from high annotation costs and inherent trade-offs between safety and helpfulness, making simultaneous optimization challenging. Method: We propose Bifactor Preference Optimization (BFPO), a novel framework that unifies safety and helpfulness preferences into a single-stage supervised learning task via a learnable global preference labeling function. BFPO introduces a bifactor joint preference reparameterization scheme to resolve multi-objective conflicts and constructs the first comprehensive benchmark—SafeHelp—that jointly evaluates both discriminative and generative capabilities for safety-helpfulness alignment. Results: Experiments demonstrate that BFPO consistently outperforms state-of-the-art methods in both safety and helpfulness, achieving comparable safety performance to strong human-intervention baselines while reducing human annotation effort and computational cost by over 90%.

Technology Category

Application Category

📝 Abstract
Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In supervised optimization, a labeling function is used to capture the global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark that includes comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO achieves the same level of safety as methods that heavily rely on human labor with less than 10% of the computational resources and human prompting and annotation process. The training recipes can be found here: https://github.com/wx-zhang/bfpo.
Problem

Research questions and friction points this paper is trying to address.

Balancing safety and helpfulness in LLMs
Reducing computational costs in RLHF fine-tuning
Improving safety without extensive human labor
Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised learning framework BFPO balances safety-helpfulness
Reparameterizes RLHF objectives into single supervised objective
Uses labeling function for global preference ranking
🔎 Similar Papers
No similar papers found.
W
Wenxuan Zhang
King Abdullah University of Science and Technology
P
Philip H. S. Torr
University of Oxford
M
Mohamed Elhoseiny
King Abdullah University of Science and Technology
Adel Bibi
Adel Bibi
University of Oxford
AI SafetyAI SecurityMachine Learning