π€ AI Summary
This work proposes PROMA, a novel proximal policy optimization method that achieves stable updates without relying on reference policies or likelihood ratio clippingβtwo common sources of entropy collapse and poor control over local KL divergence in existing approaches. By projecting out sequence-level gradient components layer-wise during backpropagation and incorporating micro-batch gradient accumulation, PROMA effectively mitigates entropy collapse while enforcing a tighter constraint on the local KL divergence. This design enhances training stability and significantly improves the robustness of policy learning. Empirical results demonstrate that PROMA outperforms current methods such as GRPO, establishing a new standard for stable and efficient policy optimization in reinforcement learning.
π Abstract
This note introduces Projected Microbatch Accumulation (PROMA), a proximal policy method that modifies gradient accumulation across microbatches rather than relying on likelihood ratios relative to a reference policy. During accumulation, PROMA projects the partially accumulated gradient to be orthogonal to the sequence-wise gradients of the current microbatch. This projection is applied layer-wise during the backward pass, enabling efficient implementation. A within-microbatch variant (Intra-PROMA) acts independently across microbatches. Empirically, PROMA achieves proximal updates without entropy collapse while providing tighter local KL control than GRPO.