🤖 AI Summary
This work challenges the prevailing Markov Decision Process (MDP) formulation in LLM reinforcement learning (RL) post-training—specifically, the fixed-context-window state definition and uniform reward allocation—arguing that these assumptions induce MDP degeneracy, effectively reducing RL to outcome-driven supervised learning. Method: Through theoretical analysis and empirical evaluation on Qwen-2.5, we compare GRPO against iterative supervised fine-tuning with positive/negative examples (i-SFT). Contribution/Results: We demonstrate that i-SFT matches GRPO’s performance on reasoning benchmarks (e.g., GSM8K, Countdown), revealing for the first time that “long chain-of-thought” behavior stems from this structural degeneracy—not genuine policy optimization. Our findings undermine the theoretical foundations of current LLM-RL approaches, showing that RL post-training under simplified assumptions lacks authentic reinforcement learning mechanisms. This work provides critical evidence for methodological reexamination and paradigmatic reformulation in LLM alignment research.
📝 Abstract
Reinforcement learning-based post-training of large language models (LLMs) has recently gained attention, particularly following the release of DeepSeek R1, which applied GRPO for fine-tuning. Amid the growing hype around improved reasoning abilities attributed to RL post-training, we critically examine the formulation and assumptions underlying these methods. We start by highlighting the popular structural assumptions made in modeling LLM training as a Markov Decision Process (MDP), and show how they lead to a degenerate MDP that doesn't quite need the RL/GRPO apparatus. The two critical structural assumptions include (1) making the MDP states be just a concatenation of the actions-with states becoming the context window and the actions becoming the tokens in LLMs and (2) splitting the reward of a state-action trajectory uniformly across the trajectory. Through a comprehensive analysis, we demonstrate that these simplifying assumptions make the approach effectively equivalent to an outcome-driven supervised learning. Our experiments on benchmarks including GSM8K and Countdown using Qwen-2.5 base models show that iterative supervised fine-tuning, incorporating both positive and negative samples, achieves performance comparable to GRPO-based training. We will also argue that the structural assumptions indirectly incentivize the RL to generate longer sequences of intermediate tokens-which in turn feeds into the narrative of"RL generating longer thinking traces."While RL may well be a very useful technique for improving the reasoning abilities of LLMs, our analysis shows that the simplistic structural assumptions made in modeling the underlying MDP render the popular LLM RL frameworks and their interpretations questionable.