PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work proposes PROMA, a novel proximal policy optimization method that achieves stable updates without relying on reference policies or likelihood ratio clipping—two common sources of entropy collapse and poor control over local KL divergence in existing approaches. By projecting out sequence-level gradient components layer-wise during backpropagation and incorporating micro-batch gradient accumulation, PROMA effectively mitigates entropy collapse while enforcing a tighter constraint on the local KL divergence. This design enhances training stability and significantly improves the robustness of policy learning. Empirical results demonstrate that PROMA outperforms current methods such as GRPO, establishing a new standard for stable and efficient policy optimization in reinforcement learning.

Technology Category

Application Category

📝 Abstract

This note introduces Projected Microbatch Accumulation (PROMA), a proximal policy method that modifies gradient accumulation across microbatches rather than relying on likelihood ratios relative to a reference policy. During accumulation, PROMA projects the partially accumulated gradient to be orthogonal to the sequence-wise gradients of the current microbatch. This projection is applied layer-wise during the backward pass, enabling efficient implementation. A within-microbatch variant (Intra-PROMA) acts independently across microbatches. Empirically, PROMA achieves proximal updates without entropy collapse while providing tighter local KL control than GRPO.

Problem

Research questions and friction points this paper is trying to address.

proximal policy update

reinforcement learning

large language model fine-tuning

KL divergence

entropy collapse

Innovation

Methods, ideas, or system contributions that make the work stand out.

Projected Microbatch Accumulation

proximal policy update

gradient projection