🤖 AI Summary
This work addresses the lack of rigorous non-asymptotic convergence guarantees for the PPO-Clip algorithm. We study a deterministic actor-only variant under softmax policy parameterization and *f*-divergence regularization. First, we establish non-uniform Lipschitz smoothness and the Łojasiewicz inequality for the policy objective function. Building on this, we prove global linear convergence under forward KL regularization and local linear convergence under reverse KL regularization. Our analysis overcomes key limitations of existing PPO theory—which often relies on asymptotic assumptions or critic-coupled conditions—by providing the first verifiable non-asymptotic convergence guarantee for an actor-only PPO variant. This significantly advances the optimization-theoretic foundation of PPO and rigorously confirms its ability to efficiently converge to the globally optimal policy.
📝 Abstract
Reinforcement learning (RL) has gained attention for aligning large language models (LLMs) via reinforcement learning from human feedback (RLHF). The actor-only variants of Proximal Policy Optimization (PPO) are widely applied for their efficiency. These algorithms incorporate a clipping mechanism to improve stability. Besides, a regularization term, such as the reverse KL-divergence or a more general (f)-divergence, is introduced to prevent policy drift. Despite their empirical success, a rigorous theoretical understanding of the problem and the algorithm's properties is limited. This paper advances the theoretical foundations of the PPO-Clip algorithm by analyzing a deterministic actor-only PPO algorithm within the general RL setting with (f)-divergence regularization under the softmax policy parameterization. We derive a non-uniform Lipschitz smoothness condition and a Łojasiewicz inequality for the considered problem. Based on these, a non-asymptotic linear convergence rate to the globally optimal policy is established for the forward KL-regularizer. Furthermore, stationary convergence and local linear convergence are derived for the reverse KL-regularizer.