🤖 AI Summary
High-quality domain-specific instructions and verifiable reward signals are scarce, hindering low-data reinforcement learning (RL) for large language models (LLMs). Method: We propose SeRL, a self-play RL framework that requires no external annotations. It features a dual-module paradigm: (1) a self-instruction generation module that dynamically constructs diverse, controllable-difficulty reasoning tasks; and (2) a majority-voting self-reward module that estimates reward signals from response consensus across multiple models, integrated with online filtering and PPO optimization. Contribution/Results: SeRL eliminates dependence on human-annotated instructions or external reward functions, enabling iterative optimization starting from minimal initial data. On multiple reasoning benchmarks, SeRL significantly outperforms existing low-data RL methods and matches the performance of strong baselines relying on high-quality annotations and verifiable rewards—establishing a novel paradigm for LLM alignment in resource-constrained domains.
📝 Abstract
Recent advances have demonstrated the effectiveness of Reinforcement Learning (RL) in improving the reasoning capabilities of Large Language Models (LLMs). However, existing works inevitably rely on high-quality instructions and verifiable rewards for effective training, both of which are often difficult to obtain in specialized domains. In this paper, we propose Self-play Reinforcement Learning(SeRL) to bootstrap LLM training with limited initial data. Specifically, SeRL comprises two complementary modules: self-instruction and self-rewarding. The former module generates additional instructions based on the available data at each training step, employing robust online filtering strategies to ensure instruction quality, diversity, and difficulty. The latter module introduces a simple yet effective majority-voting mechanism to estimate response rewards for additional instructions, eliminating the need for external annotations. Finally, SeRL performs conventional RL based on the generated data, facilitating iterative self-play learning. Extensive experiments on various reasoning benchmarks and across different LLM backbones demonstrate that the proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards. Our code is available at https://github.com/wantbook-book/SeRL.