It Takes Two: Your GRPO Is Secretly DPO

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional GRPO algorithms require large group sizes (e.g., 16-GRPO) to ensure training stability, incurring prohibitive rollout overhead. Method: This work challenges this paradigm by reformulating GRPO as a contrastive learning framework; we theoretically establish that convergence and stability are guaranteed with merely two-rollout groups, leading to the proposed 2-GRPO algorithm. 2-GRPO tightly integrates GRPO’s policy optimization with DPO’s preference modeling, preserving discriminative capability while drastically reducing computational cost. Contribution/Results: Empirical evaluation shows that 2-GRPO achieves comparable performance to 16-GRPO using only 1/8 the number of rollouts, accelerating training by over 70%. Beyond debunking the “larger groups ensure stability” assumption, this work establishes a novel, efficient, lightweight, and theoretically grounded pathway for LLM alignment.

Technology Category

Application Category

📝 Abstract
Group Relative Policy Optimization (GRPO) is a prominent reinforcement learning algorithm for post-training Large Language Models (LLMs). It is commonly believed that GRPO necessitates a large group size to ensure stable training via precise statistical estimation, which incurs substantial computational overhead. In this work, we challenge this assumption by reframing GRPO as a form of contrastive learning, which reveals a fundamental connection to Direct Preference Optimization (DPO). Motivated by DPO's empirical success, we investigate the minimal two-rollout case (2-GRPO), a configuration previously deemed infeasible. We provide a rigorous theoretical analysis to validate 2-GRPO and demonstrate empirically that it achieves performance on par with 16-GRPO, despite using only 1/8 of the rollouts and reducing training time by over 70%.
Problem

Research questions and friction points this paper is trying to address.

Challenges GRPO's need for large group sizes
Reveals GRPO's connection to DPO via contrastive learning
Validates minimal two-rollout configuration for efficient training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reframed GRPO as contrastive learning method
Introduced minimal two-rollout configuration called 2-GRPO
Achieved comparable performance with significantly reduced computation
🔎 Similar Papers
No similar papers found.