🤖 AI Summary
This work addresses a key performance bottleneck in large-scale reinforcement learning (RL) for code generation, demonstrating that data diversity and structure—not merely data volume—are critical limiting factors. To this end, the authors propose a multi-round synthetic data generation framework that operates without fine-tuning the teacher model. The teacher dynamically adjusts task difficulty based on student performance, automatically producing easy and hard variants of the same task to construct structured curricula for curriculum learning. Evaluated with Llama3.1-8B Instruct and Qwen-series models, the approach significantly improves code generation performance and consistently enhances generalization in mathematical reasoning tasks. The study systematically reveals the joint influence of task difficulty, curriculum scheduling, and environmental diversity on RL training dynamics.
📝 Abstract
Reinforcement learning (RL) has emerged as a powerful paradigm for improving large language models beyond supervised fine-tuning, yet sustaining performance gains at scale remains an open challenge, as data diversity and structure, rather than volume alone, become the limiting factor. We address this by introducing a scalable multi-turn synthetic data generation pipeline in which a teacher model iteratively refines problems based on in-context student performance summaries, producing structured difficulty progressions without any teacher fine-tuning. Compared to single-turn generation, this multi-turn approach substantially improves the yield of valid synthetic problems and naturally produces stepping stones, i.e. easier and harder variants of the same core task, that support curriculum-based training. We systematically study how task difficulty, curriculum scheduling, and environment diversity interact during RL training across the Llama3.1-8B Instruct and Qwen3-8B Base model families, with additional scaling experiments on Qwen2.5-32B. Our results show that synthetic augmentation consistently improves in-domain code and in most cases out-of-domain math performance, and we provide empirical insights into how curriculum design and data diversity jointly shape RL training dynamics.