$Qsharp$: Provably Optimal Distributional RL for LLM Post-Training

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing policy-based RL methods (e.g., PPO, DPO) struggle to overcome “shortcut reliance” inherited from pretraining during LLM post-training, undermining both alignment and reasoning capabilities. To address this, we propose a KL-regularized distributional RL framework: we rigorously reduce KL-regularized RL to no-regret online learning, establishing the first theoretical convergence bound for deterministic MDPs under only realizability assumptions; further, we derive a variance-sensitive convergence bound, proving that convergence accelerates as the variance of the reference policy decreases. Our method integrates online data aggregation, KL-regularized value iteration, and parametric Q-function modeling. Experiments demonstrate substantial improvements over PPO and DPO on mathematical reasoning benchmarks, achieving lower KL divergence with stronger generalization. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce $Qsharp$, a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function. We propose to learn the optimal $Q$ function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized $Q$-values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem. Empirically, $Qsharp$ outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. Theoretically, we establish a reduction from KL-regularized RL to no-regret online learning, providing the first bounds for deterministic MDPs under only realizability. Thanks to distributional RL, our bounds are also variance-dependent and converge faster when the reference policy has small variance. In sum, our results highlight $Qsharp$ as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees. The code can be found at https://github.com/jinpz/q_sharp.
Problem

Research questions and friction points this paper is trying to address.

Optimizes LLM post-training using KL-regularized RL.
Introduces Q♯ for provably optimal policy learning.
Improves math reasoning with lower KL divergence.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Value-based algorithm for KL-regularized RL
Distributional RL on aggregated online dataset
Theoretical reduction to no-regret online learning
🔎 Similar Papers
No similar papers found.