Scaling Self-Play with Self-Guidance

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing self-play approaches for large language models often plateau during extended training due to the problem generator producing artificially complex yet ineffective questions. This work proposes a Self-Guided Self-Play (SGS) framework that introduces, for the first time, an endogenous guidance mechanism within the model itself. In SGS, the language model simultaneously assumes three roles—solver, problem generator, and guide—with the guide evaluating the relevance and naturalness of synthesized problems to effectively suppress reward hacking and problem degradation. Evaluated in the Lean4 formal theorem-proving environment, SGS surpasses the asymptotic solving rate of the strongest reinforcement learning baseline in fewer than 80 self-play rounds. After 200 training rounds, a 7B-parameter model trained with SGS solves more problems than the pass@4 result of a 671B-parameter model.

Technology Category

Application Category

📝 Abstract

LLM self-play algorithms are notable in that, in principle, nothing bounds their learning: a Conjecturer model creates problems for a Solver, and both improve together. However, in practice, existing LLM self-play methods do not scale well with large amounts of compute, instead hitting learning plateaus. We argue this is because over long training runs, the Conjecturer learns to hack its reward, collapsing to artificially complex problems that do not help the Solver improve. To overcome this, we introduce Self-Guided Self-Play (SGS), a self-play algorithm in which the language model itself guides the Conjecturer away from degeneracy. In SGS, the model takes on three roles: Solver, Conjecturer, and a Guide that scores synthetic problems by their relevance to unsolved target problems and how clean and natural they are, providing supervision against Conjecturer collapse. Our core hypothesis is that language models can assess whether a subproblem is useful for achieving a goal. We evaluate the scaling properties of SGS by running training for significantly longer than prior works and by fitting scaling laws to cumulative solve rate curves. Applying SGS to formal theorem proving in Lean4, we find that it surpasses the asymptotic solve rate of our strongest RL baseline in fewer than 80 rounds of self-play and enables a 7B parameter model, after 200 rounds of self-play, to solve more problems than a 671B parameter model pass@4.

Problem

Research questions and friction points this paper is trying to address.

self-play

scaling

reward hacking

language models

learning plateau

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Guided Self-Play

LLM self-play

reward hacking