GAPO: Group Adaptive Policy Optimization for Real-World Code Edit

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

In real-world code editing tasks, reinforcement learning post-training often suffers from skewed reward distributions and outlier interference, leading to distorted advantage estimation and unstable policy optimization. To address this, we propose GAPO—a robust advantage estimation method based on Adaptive Highest-Density Interval (ADHI) sampling: it filters trajectories via ADHI and computes Q-values using the median rather than the group mean. GAPO is critic-free, plug-and-play, and integrates three synergistic mechanisms: grouped relative advantage estimation, adaptive outlier filtering, and median-centered value aggregation. Evaluated on 51,844 real-world editing tasks across 10 programming languages, GAPO consistently improves accuracy across nine models ranging from 3B to 14B parameters, significantly outperforming both GRPO and its variant DAPO.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) is widely used for post-training large language models (LLMs) in code editing, where group-relative methods like GRPO are popular for their critic-free, normalized advantage estimation. However, in real-world code-editing scenarios, reward distributions are often skewed with unpredictable outliers, leading to distorted advantage computation and increased noise. To address this issue, we propose Group Adaptive Policy Optimization (GAPO), which adaptively finds an outlier-free highest-density interval (HDI) per prompt and then uses the median of that interval as an adaptive Q to replace the group mean in advantage calculation. This adaptive Q robustly handles skewed distributions while remaining plug-and-play and efficient. We validate GAPO on nine instruction-tuned LLMs (3B-14B) using a large internal dataset of 51,844 real-world, history-aware code-editing tasks across 10 languages, demonstrating consistent improvements in exact match accuracy over GRPO and its variant DAPO. Code is publicly available.

Problem

Research questions and friction points this paper is trying to address.

Addresses skewed reward distributions in code editing

Handles unpredictable outliers in advantage computation

Improves reinforcement learning for real-world code tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptively finds outlier-free highest-density interval per prompt

Uses median of interval as adaptive Q for advantage calculation

Robustly handles skewed reward distributions in code editing

🔎 Similar Papers

No similar papers found.