From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the high cost and slow iteration inherent in manually designing policy optimization algorithms for language models, which lack automated mechanisms to support algorithmic-level innovation. To this end, we propose POISE, the first closed-loop framework that autonomously discovers and iteratively improves policy optimization algorithms. POISE integrates algorithmic proposals, executable code, standardized reinforcement learning evaluations, and natural-language reflections into a structured “genetic” archive, enabling cross-iteration knowledge reuse and interpretable design. Starting from GRPO, POISE automatically discovers novel mechanisms that improve the weighted Overall score by 4.6 points and significantly boost AIME25 pass@32 from 26.7% to 43.3% on mathematical reasoning tasks.

Technology Category

Application Category

📝 Abstract

Discovering improved policy optimization algorithms for language models remains a costly manual process requiring repeated mechanism-level modification and validation. Unlike simple combinatorial code search, this problem requires searching over algorithmic mechanisms tightly coupled with training dynamics while reusing empirical evidence across iterations. We propose POISE, a closed-loop framework for automated discovery of policy optimization algorithms for language models. POISE maintains a structured, genealogically linked archive linking proposals, executable implementations, standardized evaluations, and natural-language reflections to support evidence-driven iteration. In mathematical reasoning experiments starting from GRPO, POISE evaluates 64 candidate algorithms and discovers improved mechanisms, including analytic-variance scaling and validity masking. The best variant improves weighted Overall from 47.8 to 52.5 (+4.6) and increases AIME25 pass@32 from 26.7% to 43.3%, demonstrating the feasibility of automated policy optimization discovery while supporting interpretable design principles.

Problem

Research questions and friction points this paper is trying to address.

policy optimization

language models

algorithm discovery

autonomous discovery

LLM-RL

Innovation

Methods, ideas, or system contributions that make the work stand out.

automated algorithm discovery

policy optimization

LLM agents