Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement learning (RL) for large language model (LLM) inference faces three key challenges: absence of standardized best practices, fragmented mechanistic understanding, and inconsistent experimental setups undermining reproducibility. This paper introduces a unified, open-source framework that systematically disentangles the performance of mainstream RL algorithms across varying data difficulty, model scale, and architecture—via rigorous reproduction and isolated evaluation. Our core contribution is a minimalist compositional strategy that achieves stable critic-free policy optimization using only the original PPO loss, resolving long-standing convergence issues in this paradigm. Experiments demonstrate that our method consistently outperforms state-of-the-art baselines—including GRPO and DAPO—across diverse settings, yielding superior training stability and inference quality. The framework provides a reproducible, interpretable empirical guide for RL algorithm selection and deployment in LLM inference.

Technology Category

Application Category

📝 Abstract
Reinforcement learning for LLM reasoning has rapidly emerged as a prominent research area, marked by a significant surge in related studies on both algorithmic innovations and practical applications. Despite this progress, several critical challenges remain, including the absence of standardized guidelines for employing RL techniques and a fragmented understanding of their underlying mechanisms. Additionally, inconsistent experimental settings, variations in training data, and differences in model initialization have led to conflicting conclusions, obscuring the key characteristics of these techniques and creating confusion among practitioners when selecting appropriate techniques. This paper systematically reviews widely adopted RL techniques through rigorous reproductions and isolated evaluations within a unified open-source framework. We analyze the internal mechanisms, applicable scenarios, and core principles of each technique through fine-grained experiments, including datasets of varying difficulty, model sizes, and architectures. Based on these insights, we present clear guidelines for selecting RL techniques tailored to specific setups, and provide a reliable roadmap for practitioners navigating the RL for the LLM domain. Finally, we reveal that a minimalist combination of two techniques can unlock the learning capability of critic-free policies using vanilla PPO loss. The results demonstrate that our simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.
Problem

Research questions and friction points this paper is trying to address.

Standardizing RL techniques for LLM reasoning
Understanding conflicting conclusions from varied experimental setups
Selecting optimal RL techniques for specific scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic review of RL techniques via unified framework
Guidelines for RL technique selection in LLM reasoning
Minimalist two-technique combo enhances critic-free policies
🔎 Similar Papers
No similar papers found.
Z
Zihe Liu
Alibaba Group
J
Jiashun Liu
Alibaba Group
Yancheng He
Yancheng He
Alibaba Group
LLM
W
Weixun Wang
Alibaba Group
J
Jiaheng Liu
Nanjing University
Ling Pan
Ling Pan
Assistant Professor, Hong Kong University of Science and Technology
Reinforcement LearningMulti-Agent Systems
X
Xinyu Hu
Peking University
S
Shaopan Xiong
Alibaba Group
J
Ju Huang
Alibaba Group
J
Jian Hu
OpenRLHF
Shengyi Huang
Shengyi Huang
Allen Institute for Artificial Intelligence
Artificial IntelligenceReinforcement Learning
S
Siran Yang
Alibaba Group
J
Jiamang Wang
Alibaba Group
W
Wenbo Su
Alibaba Group
B
Bo Zheng
Alibaba Group