🤖 AI Summary
This work addresses the limited scalability and diversity of synthetic reasoning environments in current data generation methods, which hinders the application of verifier-based reinforcement learning to reasoning language models. To overcome this, the authors propose ReSyn, a framework that enables large-scale automatic generation of diverse reasoning environments—each comprising an instance generator and a verifier—spanning tasks such as constraint satisfaction, algorithmic puzzles, and spatial reasoning. ReSyn integrates procedural environment generation, verifier-driven reinforcement learning (RLVR), and training with the Qwen2.5-7B-Instruct model to establish a scalable synthetic data pipeline. Experimental results demonstrate substantial performance gains across multiple reasoning benchmarks, with a 27% relative improvement on the BBEH mathematics benchmark, confirming the efficacy of environmental diversity and verifier-guided supervision.
📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising approach for training reasoning language models (RLMs) by leveraging supervision from verifiers. Although verifier implementation is easier than solution annotation for many tasks, existing synthetic data generation methods remain largely solution-centric, while verifier-based methods rely on a few hand-crafted procedural environments. In this work, we scale RLVR by introducing ReSyn, a pipeline that generates diverse reasoning environments equipped with instance generators and verifiers, covering tasks such as constraint satisfaction, algorithmic puzzles, and spatial reasoning. A Qwen2.5-7B-Instruct model trained with RL on ReSyn data achieves consistent gains across reasoning benchmarks and out-of-domain math benchmarks, including a 27\% relative improvement on the challenging BBEH benchmark. Ablations show that verifier-based supervision and increased task diversity both contribute significantly, providing empirical evidence that generating reasoning environments at scale can enhance reasoning abilities in RLMs