Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in policy mirror descent (PMD) for large language model reinforcement learning, where accurate estimation of the partition function is infeasible in high-dimensional action spaces with limited sampling. To overcome this, the authors propose PMD-mean, which approximates the log-partition function using the average reward under the current policy and performs regression updates in log-policy space. Theoretical analysis reveals that this approach implicitly introduces a KL–χ² hybrid regularization, adaptively constraining large policy updates and yielding more conservative adjustments in low-return regions, thereby significantly enhancing robustness to finite-sample errors. Experiments demonstrate that PMD-mean outperforms existing baselines on mathematical reasoning tasks, achieving higher performance, greater stability, and faster convergence.

Technology Category

Application Category

📝 Abstract
Policy mirror descent (PMD) provides a principled framework for reinforcement learning (RL) by iteratively solving KL-regularized policy improvement subproblems. While this approach has been adopted in training advanced LLMs such as Kimi K1.5/K2, the ideal closed-form PMD updates require reliable partition function estimation, a significant challenge when working with limited rollouts in the vast action spaces of LLMs. We investigate a practical algorithm, termed PMD-mean, that approximates the log-partition term with the mean reward under the sampling policy and performs regression in log-policy space. Specifically, we characterize the population solution of PMD-mean and demonstrate that it implicitly optimizes mirror descent subproblems with an adaptive mixed KL--$\chi^2$ regularizer. This additional $\chi^2$ regularization constrains large probability changes, producing more conservative updates when expected rewards are low and enhancing robustness against finite-sample estimation errors. Experiments on math reasoning tasks show that PMD-mean achieves superior performance with improved stability and time efficiency. These findings deepen our understanding of PMD-mean and illuminate pathways toward principled improvements in RL algorithms for LLMs. Code is available at https://github.com/horizon-rl/OpenKimi.
Problem

Research questions and friction points this paper is trying to address.

log-partition function
policy mirror descent
LLM post-training
reinforcement learning
action space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy Mirror Descent
Implicit Regularization
Log-Partition Approximation
KL-χ² Regularization
LLM Post-Training
🔎 Similar Papers
No similar papers found.
Z
Zhenghao Xu
Georgia Institute of Technology
Q
Qin Lu
Amazon
C
Changlong Yu
Amazon
Tuo Zhao
Tuo Zhao
Associate Professor, Georgia Tech
Machine LearningLarge Language ModelsArtificial IntelligenceOptimizationStatistics