🤖 AI Summary
The continual advancement of large language models (LLMs) is bottlenecked by the scarcity of high-quality training data.
Method: This paper introduces Language Self-Play—a novel framework that integrates game-theoretic principles with reinforcement learning to model language capability as a self-antagonistic strategy evolution process, enabling autonomous model improvement without new external data. Departing from reliance on large-scale annotated datasets, it employs Llama-3.2-3B-Instruct as the base model and conducts iterative self-play training on instruction-following tasks.
Contribution/Results: Through a closed-loop optimization cycle—comprising self-generated instruction refinement and automated evaluation—the method achieves significant performance gains on complex instruction-following benchmarks, outperforming data-driven baselines of comparable scale. Empirical results validate the feasibility and efficacy of sustained LLM capability evolution in a truly data-free paradigm.
📝 Abstract
Large language models (LLMs) have advanced rapidly in recent years, driven by scale, abundant high-quality training data, and reinforcement learning. Yet this progress faces a fundamental bottleneck: the need for ever more data from which models can continue to learn. In this work, we propose a reinforcement learning approach that removes this dependency by enabling models to improve without additional data. Our method leverages a game-theoretic framework of self-play, where a model's capabilities are cast as performance in a competitive game and stronger policies emerge by having the model play against itself - a process we call Language Self-Play (LSP). Experiments with Llama-3.2-3B-Instruct on instruction-following benchmarks show that pretrained models can not only enhance their performance on challenging tasks through self-play alone, but can also do so more effectively than data-driven baselines.