Hölder Policy Optimisation

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Existing policy optimization methods rely on fixed aggregation mechanisms that struggle to balance gradient concentration and variance stability, often leading to training instability or performance plateaus. This work proposes the first general policy optimization framework based on Hölder averaging, which continuously modulates the trade-off between gradient concentration and variance through an adjustable parameter $ p $. A dynamic annealing algorithm is introduced to adaptively schedule $ p $ during training, thereby unifying gradient concentration and stability and overcoming the limitations of static aggregation. By integrating trajectory-level advantage estimation with token-level probability aggregation, the method achieves a state-of-the-art average accuracy of 54.9% across multiple mathematical reasoning benchmarks—representing a 7.2% improvement over standard GRPO—and attains a 93.8% success rate on ALFWorld tasks.

📝 Abstract

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose \textbf{HölderPO}, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter $p$, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger $p$ concentrates the gradient to amplify sparse learning signals, whereas a smaller $p$ strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules $p$ across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of $54.9\%$ across multiple mathematical benchmarks, yielding a substantial $7.2\%$ relative gain over standard GRPO and secures an exceptional $93.8\%$ success rate on ALFWorld.

Problem

Research questions and friction points this paper is trying to address.

policy optimisation

advantage aggregation

gradient variance

training stability

token-level probability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hölder Policy Optimisation

token-level aggregation

gradient variance control