🤖 AI Summary
Balancing safety and helpfulness in large language models (LLMs) remains challenging: existing approaches either sacrifice utility for safety or lack adaptive trade-off capabilities. This paper proposes a Mixture-of-Experts (MoE)-based dual-preference optimization framework, introducing the novel paradigm of “safety–usefulness decoupled training + joint inference”: safety and usefulness experts are trained separately via single-preference-augmented Direct Preference Optimization (DPO) to ensure functional decoupling; a dynamic gating mechanism—designed explicitly for balanced multi-objective optimization—adapts expert weights during inference to fuse outputs contextually. Evaluated on three benchmark datasets, our method outperforms state-of-the-art approaches, achieving simultaneous improvements in both safety and helpfulness metrics—marking the first work to realize synergistic optimization under strict safety constraints while preserving high-quality response generation.
📝 Abstract
As large language models (LLMs) are increasingly applied across various domains, enhancing safety while maintaining the helpfulness of LLMs has become a critical challenge. Recent studies solve this problem through safety-constrained online preference optimization or safety-constrained offline preference optimization. However, the safety-constrained online methods often suffer from excessive safety, which might reduce helpfulness, while the safety-constrained offline methods perform poorly in adaptively balancing safety and helpfulness. To address these limitations, we propose MidPO, a extbf{underline{Mi}}xture of Experts (MoE) framework for safety-helpfulness extbf{underline{d}}ual extbf{underline{P}}reference extbf{underline{O}}ptimization. Firstly, MidPO devises single-preference enhanced direct preference optimization approach to transform the base model into two independent experts, termed safety and helpfulness experts, and fine-tunes the two independent experts for optimal safety or helpfulness performance. Secondly, to achieve an effective balance between safety and helpfulness, MidPO incorporates the two experts into the MoE framework and designs a dynamic routing mechanism to allocate contributions from each expert adaptively. We conduct quantitative and qualitative experiments on three popular datasets to demonstrate the proposed MidPO significantly outperforms state-of-the-art approaches in both safety and helpfulness. The code and models will be released.