All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Despite information-theoretic arguments suggesting reward modeling incurs inevitable information loss and cannot generate novel content, reinforcement learning (RL)–based fine-tuning (e.g., PPO) consistently outperforms direct supervised fine-tuning (SFT) in large language model alignment. Method: We introduce the “generation–verification gap” perspective: while reward models simplify verification and are learnable yet lossy, RL leverages policy constraints to efficiently prune the generation space; their synergy mitigates verification bias. Our approach integrates information-theoretic modeling, formal analysis, and preference-data-driven empirical comparison (SFT vs. RLHF). Contribution/Results: We provide the first theoretical explanation for RL’s superiority grounded in information bottlenecks and task asymmetry—demonstrating that performance gains arise from search-space reduction, not information gain. This yields principled guidance for alignment training of large models.

Technology Category

Application Category

📝 Abstract
From a first-principles perspective, it may seem odd that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure. Specifically, one first trains a reward model (RM) on some dataset (e.g. human preferences) before using it to provide online feedback as part of a downstream reinforcement learning (RL) procedure, rather than directly optimizing the policy parameters on the dataset via offline maximum likelihood estimation. In fact, from an information-theoretic perspective, we can only lose information via passing through a reward model and cannot create any new information via on-policy sampling. To explain this discrepancy, we scrutinize several hypotheses on the value of RL in FT through both theoretical and empirical lenses. Of the hypotheses considered, we find the most support for the explanation that on problems with a generation-verification gap, the combination of the ease of learning the relatively simple RM (verifier) from the preference data, coupled with the ability of the downstream RL procedure to then filter its search space to the subset of policies (generators) that are optimal for relatively simple verifiers is what leads to the superior performance of online FT.
Problem

Research questions and friction points this paper is trying to address.

Explains why reinforcement learning outperforms direct likelihood optimization in fine-tuning.
Investigates the role of reward models in bridging generation-verification gaps.
Demonstrates the efficiency of combining simple reward models with RL for policy optimization.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training with reward model
Reinforcement learning for fine-tuning
Optimizing policies via online feedback