🤖 AI Summary
Existing proximal policy optimization methods (e.g., GRPO, CISPO) rely on multi-step gradient updates and importance-weight clipping to approximate the natural policy gradient under a reference policy, resulting in computational inefficiency and limited training stability. This paper proposes Isometric Policy Optimization (ISOPO), the first method to exactly approximate the natural policy gradient under the Fisher metric in a single update step. ISOPO modulates the advantage function via layer-wise Neural Tangent Kernels (NTKs), enabling log-probability gradient normalization without requiring the reference policy π_old or importance sampling. By integrating layer-wise backpropagation with a REINFORCE-compatible architecture, ISOPO achieves significantly improved update stability and sample efficiency at nearly zero additional computational overhead. Empirically, ISOPO matches or surpasses the performance of multi-step approximation methods across standard benchmarks.
📝 Abstract
This note introduces Isometric Policy Optimization (ISOPO), an efficient method to approximate the natural policy gradient in a single gradient step. In comparison, existing proximal policy methods such as GRPO or CISPO use multiple gradient steps with variants of importance ratio clipping to approximate a natural gradient step relative to a reference policy. In its simplest form, ISOPO normalizes the log-probability gradient of each sequence in the Fisher metric before contracting with the advantages. Another variant of ISOPO transforms the microbatch advantages based on the neural tangent kernel in each layer. ISOPO applies this transformation layer-wise in a single backward pass and can be implemented with negligible computational overhead compared to vanilla REINFORCE.