🤖 AI Summary
In real-world code editing tasks, reinforcement learning post-training often suffers from skewed reward distributions and outlier interference, leading to distorted advantage estimation and unstable policy optimization. To address this, we propose GAPO—a robust advantage estimation method based on Adaptive Highest-Density Interval (ADHI) sampling: it filters trajectories via ADHI and computes Q-values using the median rather than the group mean. GAPO is critic-free, plug-and-play, and integrates three synergistic mechanisms: grouped relative advantage estimation, adaptive outlier filtering, and median-centered value aggregation. Evaluated on 51,844 real-world editing tasks across 10 programming languages, GAPO consistently improves accuracy across nine models ranging from 3B to 14B parameters, significantly outperforming both GRPO and its variant DAPO.
📝 Abstract
Reinforcement learning (RL) is widely used for post-training large language models (LLMs) in code editing, where group-relative methods like GRPO are popular for their critic-free, normalized advantage estimation. However, in real-world code-editing scenarios, reward distributions are often skewed with unpredictable outliers, leading to distorted advantage computation and increased noise. To address this issue, we propose Group Adaptive Policy Optimization (GAPO), which adaptively finds an outlier-free highest-density interval (HDI) per prompt and then uses the median of that interval as an adaptive Q to replace the group mean in advantage calculation. This adaptive Q robustly handles skewed distributions while remaining plug-and-play and efficient. We validate GAPO on nine instruction-tuned LLMs (3B-14B) using a large internal dataset of 51,844 real-world, history-aware code-editing tasks across 10 languages, demonstrating consistent improvements in exact match accuracy over GRPO and its variant DAPO. Code is publicly available.