AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
This work addresses the training instability and heavy hyperparameter tuning burden in large language model reinforcement learning with PPO/GRPO, which stem from fixed clipping thresholds and static decoding temperatures. To overcome these limitations, the authors propose Adaptive Group-based Policy Optimization (AGPO), a value-network-free approach that leverages multidimensional statistics—such as reward distributions, entropy, and KL divergence—from a population of policies to construct a shared probing state. This state drives an adaptive clipping mechanism and a bidirectional temperature controller, dynamically modulating policy update magnitudes and exploration intensity. Evaluated across nine Chinese and English mathematical and STEM benchmarks, AGPO substantially outperforms PPO and GRPO, achieving 67.3% on GSM8K and 40.5% on MATH with Qwen2.5-14B, with consistent gains transferable to Llama-3-8B and Gemma-2-9B.
📝 Abstract
Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement of GRPO that uses group-level statistics to control both update magnitude and exploration. AGPO uses a shared probe-derived statistical state to drive two controllers: (i) adaptive clipping, which sets the trust-region size from reward dispersion and skewness, probe vote entropy, policy entropy, and step-wise KL drift; and (ii) bidirectional adaptive temperature sampling, which heats or cools decoding around a base temperature according to centered uncertainty relative to a running baseline. On nine English and Chinese math/STEM benchmarks, Qwen2.5-14B trained with AGPO outperforms PPO/GRPO under the same generated-token budget, reaching 67.3% on GSM8K and 40.5% on MATH. Gains transfer to Llama-3-8B and Gemma-2-9B, and ablations confirm both modules are complementary. Our implementation is publicly available at https://github.com/wandugu/paper_agpo.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
LLM reasoning
fixed clipping
decoding temperature
training brittleness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Clipping
Adaptive Temperature Sampling
Group Policy Optimization
Statistical Feedback
Critic-Free RL
🔎 Similar Papers