π€ AI Summary
This work addresses the high cost and slow iteration inherent in manually designing policy optimization algorithms for language models, which lack automated mechanisms to support algorithmic-level innovation. To this end, we propose POISE, the first closed-loop framework that autonomously discovers and iteratively improves policy optimization algorithms. POISE integrates algorithmic proposals, executable code, standardized reinforcement learning evaluations, and natural-language reflections into a structured βgeneticβ archive, enabling cross-iteration knowledge reuse and interpretable design. Starting from GRPO, POISE automatically discovers novel mechanisms that improve the weighted Overall score by 4.6 points and significantly boost AIME25 pass@32 from 26.7% to 43.3% on mathematical reasoning tasks.
π Abstract
Discovering improved policy optimization algorithms for language models remains a costly manual process requiring repeated mechanism-level modification and validation. Unlike simple combinatorial code search, this problem requires searching over algorithmic mechanisms tightly coupled with training dynamics while reusing empirical evidence across iterations. We propose POISE, a closed-loop framework for automated discovery of policy optimization algorithms for language models. POISE maintains a structured, genealogically linked archive linking proposals, executable implementations, standardized evaluations, and natural-language reflections to support evidence-driven iteration. In mathematical reasoning experiments starting from GRPO, POISE evaluates 64 candidate algorithms and discovers improved mechanisms, including analytic-variance scaling and validity masking. The best variant improves weighted Overall from 47.8 to 52.5 (+4.6) and increases AIME25 pass@32 from 26.7% to 43.3%, demonstrating the feasibility of automated policy optimization discovery while supporting interpretable design principles.