🤖 AI Summary
This work addresses the training instability and heavy hyperparameter tuning burden in large language model reinforcement learning with PPO/GRPO, which stem from fixed clipping thresholds and static decoding temperatures. To overcome these limitations, the authors propose Adaptive Group-based Policy Optimization (AGPO), a value-network-free approach that leverages multidimensional statistics—such as reward distributions, entropy, and KL divergence—from a population of policies to construct a shared probing state. This state drives an adaptive clipping mechanism and a bidirectional temperature controller, dynamically modulating policy update magnitudes and exploration intensity. Evaluated across nine Chinese and English mathematical and STEM benchmarks, AGPO substantially outperforms PPO and GRPO, achieving 67.3% on GSM8K and 40.5% on MATH with Qwen2.5-14B, with consistent gains transferable to Llama-3-8B and Gemma-2-9B.
📝 Abstract
Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement of GRPO that uses group-level statistics to control both update magnitude and exploration. AGPO uses a shared probe-derived statistical state to drive two controllers: (i) adaptive clipping, which sets the trust-region size from reward dispersion and skewness, probe vote entropy, policy entropy, and step-wise KL drift; and (ii) bidirectional adaptive temperature sampling, which heats or cools decoding around a base temperature according to centered uncertainty relative to a running baseline. On nine English and Chinese math/STEM benchmarks, Qwen2.5-14B trained with AGPO outperforms PPO/GRPO under the same generated-token budget, reaching 67.3% on GSM8K and 40.5% on MATH. Gains transfer to Llama-3-8B and Gemma-2-9B, and ablations confirm both modules are complementary. Our implementation is publicly available at https://github.com/wandugu/paper_agpo.