UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized Implicit Reward Function

📅 2024-08-27
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing alignment methods for large language models—such as RLHF, DPO, and KTO—suffer from distinct limitations: RLHF entails complex reinforcement learning pipelines; DPO requires costly pairwise preference data; and KTO exhibits constrained generalization. This paper proposes UNA, the first framework unifying these paradigms via a generalized implicit reward function, recasting alignment as a supervised learning problem compatible with heterogeneous feedback—including pairwise, binary, and scalar signals. Theoretically, we rigorously prove that the optimal policy is induced by this implicit reward. Methodologically, UNA introduces a policy-reward mapping and a reward-difference minimization objective, drastically simplifying training. Empirically, UNA outperforms strong baselines across multiple benchmarks, achieving faster convergence, reduced memory footprint, and enhanced training stability.

Technology Category

Application Category

📝 Abstract
An LLM is pretrained on trillions of tokens, but the pretrained LLM may still generate undesired responses. To solve this problem, alignment techniques such as RLHF, DPO and KTO are proposed. However, these alignment techniques have limitations. For example, RLHF requires training the reward model and policy separately, which is complex, time-consuming, memory intensive and unstable during training processes. DPO proposes a mapping between an optimal policy and a reward, greatly simplifying the training process of RLHF. However, it can not take full advantages of a reward model and it is limited to pairwise preference data. In this paper, we propose extbf{UN}ified extbf{A}lignment (UNA) which unifies RLHF/PPO, DPO and KTO. Firstly, we mathematically prove that given the classical RLHF objective, the optimal policy is induced by a generalize implicit reward function. With this novel mapping between a reward model and an optimal policy, UNA can 1. unify RLHF/PPO, DPO and KTO into a supervised learning of minimizing the difference between an implicit reward and an explicit reward; 2. outperform RLHF/PPO while simplify, stabilize, speed up and reduce memory burden of RL fine-tuning process; 3. accommodate different feedback types including pairwise, binary and scalar feedback. Downstream experiments show UNA outperforms DPO, KTO and RLHF.
Problem

Research questions and friction points this paper is trying to address.

Unifies RLHF/PPO, DPO, and KTO alignment techniques
Simplifies and stabilizes RL fine-tuning process
Accommodates diverse feedback types effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies RLHF/PPO, DPO, and KTO via generalized implicit reward
Simplifies RL fine-tuning, enhancing speed and stability
Supports diverse feedback types: pairwise, binary, scalar
🔎 Similar Papers
No similar papers found.
Z
Zhichao Wang
Salesforce
B
Bin Bi
Salesforce
C
Can Huang
School of Mathematical Sciences, Xiamen University
S
Shiva K. Pentyala
Salesforce
Z
Zixu Zhu
Salesforce
S
S. Asur
Salesforce
Na Cheng
Na Cheng
Dalian University of Technology
Computer visionImage enhancementObject detection