ISOPO: Proximal policy gradients without pi-old

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing proximal policy optimization methods (e.g., GRPO, CISPO) rely on multi-step gradient updates and importance-weight clipping to approximate the natural policy gradient under a reference policy, resulting in computational inefficiency and limited training stability. This paper proposes Isometric Policy Optimization (ISOPO), the first method to exactly approximate the natural policy gradient under the Fisher metric in a single update step. ISOPO modulates the advantage function via layer-wise Neural Tangent Kernels (NTKs), enabling log-probability gradient normalization without requiring the reference policy π_old or importance sampling. By integrating layer-wise backpropagation with a REINFORCE-compatible architecture, ISOPO achieves significantly improved update stability and sample efficiency at nearly zero additional computational overhead. Empirically, ISOPO matches or surpasses the performance of multi-step approximation methods across standard benchmarks.

Technology Category

Application Category

📝 Abstract
This note introduces Isometric Policy Optimization (ISOPO), an efficient method to approximate the natural policy gradient in a single gradient step. In comparison, existing proximal policy methods such as GRPO or CISPO use multiple gradient steps with variants of importance ratio clipping to approximate a natural gradient step relative to a reference policy. In its simplest form, ISOPO normalizes the log-probability gradient of each sequence in the Fisher metric before contracting with the advantages. Another variant of ISOPO transforms the microbatch advantages based on the neural tangent kernel in each layer. ISOPO applies this transformation layer-wise in a single backward pass and can be implemented with negligible computational overhead compared to vanilla REINFORCE.
Problem

Research questions and friction points this paper is trying to address.

Introduces Isometric Policy Optimization for natural gradient approximation
Compares ISOPO to existing proximal policy methods using multiple steps
Applies layer-wise transformations with minimal computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-step natural policy gradient approximation
Fisher metric normalization of log-probability gradients
Layer-wise neural tangent kernel advantage transformation
🔎 Similar Papers
No similar papers found.