🤖 AI Summary
To address the challenge of policy optimization under sentence-level sparse rewards in Reinforcement Learning from Human Feedback (RLHF), this paper proposes the Reward Token Optimization (RTO) algorithm. RTO is the first method to seamlessly integrate token-level quality representations implicitly learned by Direct Preference Optimization (DPO) with Proximal Policy Optimization (PPO), establishing a fine-grained Markov Decision Process (MDP)-based modeling framework that jointly trains token-level reward modeling and policy optimization. Theoretically, RTO achieves near-optimal trade-offs between sample efficiency and policy optimality. Empirically, RTO outperforms standard PPO by +7.5 and +4.1 points on AlpacaEval 2 and Arena-Hard, respectively, and significantly surpasses mainstream direct preference learning approaches. This work introduces an efficient, interpretable paradigm for aligning large language models under sparse-reward settings.
📝 Abstract
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards -- a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of large language models, its open-source implementation is still largely sub-optimal. To address these issues, we introduce a framework that models RLHF problems as a Markov decision process (MDP), enabling the capture of fine-grained token-wise information. Under this framework, we introduce an algorithm Reinforced Token Optimization ( exttt{RTO}), which learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal. Theoretically, exttt{RTO} is proven to have the capability of finding the near-optimal policy sample-efficiently. For its practical implementation, exttt{RTO} innovatively integrates Direct Preference Optimization (DPO) and PPO. DPO, originally derived from sparse sentence rewards, surprisingly provides us with a token-wise characterization of response quality, which is seamlessly incorporated into our subsequent PPO training stage. Extensive experiments demonstrate that exttt{RTO} performs better than PPO and other direct preference learning algorithms. In particular, RTO outperforms PPO by 7.5 points on the AlpacaEval 2 benchmark and by 4.1 points on Arena-Hard. Our code and models are available at href{https://github.com/zkshan2002/RTO}{https://github.com/zkshan2002/RTO}.