The Two-Stage Decision-Sampling Hypothesis: Understanding the Emergence of Self-Reflection in RL-Trained LLMs

📅 2026-01-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This study uncovers the intrinsic mechanism by which large language models acquire self-reflection capabilities through reinforcement learning (RL) training. Addressing how a unified optimization objective gives rise to functionally distinct generation and correction modules, the work proposes a two-stage decision-sampling hypothesis that decouples the policy into a sampling policy (for generation) and a decision policy (for verification). Through gradient attribution analysis, theoretical proofs, and empirical evaluation on arithmetic reasoning tasks, the study demonstrates—via a gradient allocation perspective—that RL enhances self-correction by balancing the optimization of both policies, whereas supervised fine-tuning and KL regularization underperform due to gradient bias toward the sampling policy. The findings establish that RL’s generalization advantage primarily stems from strengthening the decision policy, offering a first-principles mechanistic explanation for self-correction in reasoning models.

Technology Category

Application Category

📝 Abstract

Self-reflection capabilities emerge in Large Language Models after RL post-training, with multi-turn RL achieving substantial gains over SFT counterparts. Yet the mechanism of how a unified optimization objective gives rise to functionally distinct capabilities of generating solutions and evaluating when to revise them remains opaque. To address this question, we introduce the Gradient Attribution Property to characterize how reward gradients distribute across policy components, formalized through the Two-Stage Decision-Sampling (DS) Hypothesis, which decomposes the policy into sampling ($\pi_{sample}$) for generation and decision ($\pi_{d}$) for verification. We prove that surrogate rewards exhibit Balanced Gradient Attribution, while SFT and KL penalties exhibit Unbalanced Gradient Attribution, with length-weighting creating asymmetric regularization that constrains $\pi_{sample}$ while leaving $\pi_{d}$ under-optimized, providing an theoretical explanation of why RL succeeds where SFT fails. We also empirically validate our theoretical predictions on arithmetic reasoning demonstrates that RL's superior generalization stems primarily from improved decision-making ($\pi_{d}$) rather than sampling capabilities, providing a first-principles mechanistic explanation for self-correction in thinking models.

Problem

Research questions and friction points this paper is trying to address.

self-reflection

reinforcement learning

large language models

mechanistic interpretability

decision-making

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-Stage Decision-Sampling

Gradient Attribution Property

Self-Reflection