From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

πŸ“… 2026-03-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the high cost and slow iteration inherent in manually designing policy optimization algorithms for language models, which lack automated mechanisms to support algorithmic-level innovation. To this end, we propose POISE, the first closed-loop framework that autonomously discovers and iteratively improves policy optimization algorithms. POISE integrates algorithmic proposals, executable code, standardized reinforcement learning evaluations, and natural-language reflections into a structured β€œgenetic” archive, enabling cross-iteration knowledge reuse and interpretable design. Starting from GRPO, POISE automatically discovers novel mechanisms that improve the weighted Overall score by 4.6 points and significantly boost AIME25 pass@32 from 26.7% to 43.3% on mathematical reasoning tasks.

Technology Category

Application Category

πŸ“ Abstract
Discovering improved policy optimization algorithms for language models remains a costly manual process requiring repeated mechanism-level modification and validation. Unlike simple combinatorial code search, this problem requires searching over algorithmic mechanisms tightly coupled with training dynamics while reusing empirical evidence across iterations. We propose POISE, a closed-loop framework for automated discovery of policy optimization algorithms for language models. POISE maintains a structured, genealogically linked archive linking proposals, executable implementations, standardized evaluations, and natural-language reflections to support evidence-driven iteration. In mathematical reasoning experiments starting from GRPO, POISE evaluates 64 candidate algorithms and discovers improved mechanisms, including analytic-variance scaling and validity masking. The best variant improves weighted Overall from 47.8 to 52.5 (+4.6) and increases AIME25 pass@32 from 26.7% to 43.3%, demonstrating the feasibility of automated policy optimization discovery while supporting interpretable design principles.
Problem

Research questions and friction points this paper is trying to address.

policy optimization
language models
algorithm discovery
autonomous discovery
LLM-RL
Innovation

Methods, ideas, or system contributions that make the work stand out.

automated algorithm discovery
policy optimization
LLM agents
evidence-driven iteration
interpretable RL algorithms
πŸ”Ž Similar Papers
No similar papers found.
S
Sirui Xia
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University
Yikai Zhang
Yikai Zhang
Fudan university
Natural Language ProcessingAutonomous Agent
Aili Chen
Aili Chen
Fudan University
Large Language ModelReasoning and PlanningLanguage AgentLLM Personalization
Siye Wu
Siye Wu
Fudan University
S
Siyu Yuan
School of Data Science, Fudan University
Y
Yanghua Xiao
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University