F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the tendency of group-based sampling in reinforcement learning to overlook rare yet correct trajectories, which biases policies toward frequently occurring solutions. Inspired by Focal Loss, the authors propose a difficulty-aware advantage scaling mechanism that dynamically down-weights updates from high-success-probability samples within group-relative policy optimization frameworks such as GRPO. This approach enhances learning from rare correct trajectories without increasing group size or computational overhead, offering the first explicit solution to the problem of rare correct trajectory forgetting in RLVR. Experiments on Qwen2.5-7B demonstrate significant improvements in pass@256 (e.g., from 64.1 to 70.3 with GRPO) while maintaining or even improving pass@1 performance.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, large group sizes are not feasible due to computational limits, which biases learning toward trajectories that are already likely. Smaller groups often miss rare-correct trajectories while still containing mixed rewards, concentrating probability on common solutions. We derive the probability that updates miss rare-correct modes as a function of group size, showing non-monotonic behavior, and characterize how updates redistribute mass within the correct set, revealing that unsampled-correct mass can shrink even as total correct mass grows. Motivated by this analysis, we propose a difficulty-aware advantage scaling coefficient, inspired by Focal loss, that down-weights updates on high-success prompts. The lightweight modification can be directly integrated into any group-relative RLVR algorithm such as GRPO, DAPO, and CISPO. On Qwen2.5-7B across in-domain and out-of-domain benchmarks, our method improves pass@256 from 64.1 $\rightarrow$ 70.3 (GRPO), 69.3 $\rightarrow$ 72.5 (DAPO), and 73.2 $\rightarrow$ 76.8 (CISPO), while preserving or improving pass@1, without increasing group size or computational cost.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards

group sampling

rare-correct trajectories

policy bias

advantage estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Focal loss

difficulty-aware advantage scaling

rare-correct trajectories