GASP: Guided Asymmetric Self-Play For Coding LLMs

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Existing asymmetric self-play approaches generate questions lacking goal-directedness, limiting their effectiveness in enhancing model capabilities. This work proposes a goal-guided self-play framework that leverages “goal posts”—challenging problems derived from real-world data—to steer the teacher model in producing a curriculum of progressively harder questions, thereby establishing a clear and structured learning trajectory. By integrating asymmetric self-play with controllable difficulty in question generation, the method substantially improves both training efficiency and model generalization. Evaluated on LiveCodeBench, the approach achieves a 2.5% absolute improvement in pass@20 and successfully solves multiple high-difficulty programming problems that baseline models fail to address.

Technology Category

Application Category

📝 Abstract

Asymmetric self-play has emerged as a promising paradigm for post-training large language models, where a teacher continually generates questions for a student to solve at the edge of the student's learnability. Although these methods promise open-ended data generation bootstrapped from no human data, they suffer from one major problem: not all problems that are hard to solve are interesting or informative to improve the overall capabilities of the model. Current asymmetric self-play methods are goal-agnostic with no real grounding. We propose Guided Asymmetric Self-Play (GASP), where grounding is provided by real-data goalpost questions that are identified to pose a hard exploration challenge to the model. During self-play, the teacher first generates an easier variant of a hard question, and then a harder variant of that easier question, with the goal of gradually closing the gap to the goalpost throughout training. Doing so, we improve pass@20 on LiveCodeBench (LCB) by 2.5% over unguided asymmetric self-play, and through the curriculum constructed by the teacher, we manage to solve hard goalpost questions that remain out of reach for all baselines.

Problem

Research questions and friction points this paper is trying to address.

asymmetric self-play

goal-agnostic

large language models

data generation

model capability improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Guided Asymmetric Self-Play

goalpost questions

curriculum learning