PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates

πŸ“… 2026-01-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes PROMA, a novel proximal policy optimization method that achieves stable updates without relying on reference policies or likelihood ratio clippingβ€”two common sources of entropy collapse and poor control over local KL divergence in existing approaches. By projecting out sequence-level gradient components layer-wise during backpropagation and incorporating micro-batch gradient accumulation, PROMA effectively mitigates entropy collapse while enforcing a tighter constraint on the local KL divergence. This design enhances training stability and significantly improves the robustness of policy learning. Empirical results demonstrate that PROMA outperforms current methods such as GRPO, establishing a new standard for stable and efficient policy optimization in reinforcement learning.

Technology Category

Application Category

πŸ“ Abstract
This note introduces Projected Microbatch Accumulation (PROMA), a proximal policy method that modifies gradient accumulation across microbatches rather than relying on likelihood ratios relative to a reference policy. During accumulation, PROMA projects the partially accumulated gradient to be orthogonal to the sequence-wise gradients of the current microbatch. This projection is applied layer-wise during the backward pass, enabling efficient implementation. A within-microbatch variant (Intra-PROMA) acts independently across microbatches. Empirically, PROMA achieves proximal updates without entropy collapse while providing tighter local KL control than GRPO.
Problem

Research questions and friction points this paper is trying to address.

proximal policy update
reinforcement learning
large language model fine-tuning
KL divergence
entropy collapse
Innovation

Methods, ideas, or system contributions that make the work stand out.

Projected Microbatch Accumulation
proximal policy update
gradient projection
reference-free RL
entropy collapse
πŸ”Ž Similar Papers