🤖 AI Summary
This work addresses the long-standing dichotomy between reinforcement learning (RL)-based and RL-free methods in Reinforcement Learning from Human Feedback (RLHF). We propose a unified modeling framework by rigorously deriving the RLHF objective as a complete RL formulation for the first time, revealing the fundamental equivalence of both paradigms under a neural structured bandit model. Based on this insight, we introduce Generalized REINFORCE Optimization (GRO), a principled framework that seamlessly integrates policy-gradient methods (e.g., PPO) with supervised alignment techniques. GRO enables flexible switching and hybrid training between RL and RL-free strategies, improving generalization and human preference alignment efficiency while preserving training stability. Our approach provides the first rigorous, fully RL-theoretic interpretation of RLHF and establishes a scalable, unified algorithmic paradigm grounded in formal RL foundations.
📝 Abstract
In this article, we primarily examine a variety of RL-based and RL-free methods designed to address Reinforcement Learning from Human Feedback (RLHF) and Large Reasoning Models (LRMs). We begin with a concise overview of the typical steps involved in RLHF and LRMs. Next, we reinterpret several RL-based and RL-free algorithms through the perspective of neural structured bandit prediction, providing a clear conceptual framework that uncovers a deeper connection between these seemingly distinct approaches. Following this, we briefly review some core principles of reinforcement learning, drawing attention to an often-overlooked aspect in existing RLHF studies. This leads to a detailed derivation of the standard RLHF objective within a full RL context, demonstrating its equivalence to neural structured bandit prediction. Finally, by reinvestigating the principles behind Proximal Policy Optimization (PPO), we pinpoint areas needing adjustment, which culminates in the introduction of the Generalized Reinforce Optimization (GRO) framework, seamlessly integrating RL-based and RL-free methods in RLHF. We look forward to the community's efforts to empirically validate GRO and invite constructive feedback.