You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

📅 2025-11-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that small-scale foundation models (0.5B–7B parameters) exhibit weak reasoning capabilities and struggle to benefit from label-free reinforcement learning (RL). To this end, we propose a progressive unsupervised reasoning training framework tailored for weak models. Our method comprises two core components: (i) a curriculum-guided, majority-voting–based reasoning trajectory masking mechanism that enables controllable-difficulty reasoning path modeling; and (ii) a data difficulty grading generation pipeline that facilitates incremental reasoning skill acquisition under fully unsupervised conditions. Extensive experiments demonstrate consistent reasoning performance gains across multiple model scales, significantly outperforming existing unsupervised RL baselines. The results validate the framework’s effectiveness, generalizability, and scalability in resource-constrained settings.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models have demonstrated the promise of unsupervised reinforcement learning (RL) methods for enhancing reasoning capabilities without external supervision. However, the generalizability of these label-free RL approaches to smaller base models with limited reasoning capabilities remains unexplored. In this work, we systematically investigate the performance of label-free RL methods across different model sizes and reasoning strengths, from 0.5B to 7B parameters. Our empirical analysis reveals critical limitations: label-free RL is highly dependent on the base model's pre-existing reasoning capability, with performance often degrading below baseline levels for weaker models. We find that smaller models fail to generate sufficiently long or diverse chain-of-thought reasoning to enable effective self-reflection, and that training data difficulty plays a crucial role in determining success. To address these challenges, we propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems during training and mask no-majority rollouts during training. Additionally, we introduce a data curation pipeline to generate samples with predefined difficulty. Our approach demonstrates consistent improvements across all model sizes and reasoning capabilities, providing a path toward more robust unsupervised RL that can bootstrap reasoning abilities in resource-constrained models. We make our code available at https://github.com/BorealisAI/CuMa
Problem

Research questions and friction points this paper is trying to address.

Investigating label-free RL limitations in small models with weak reasoning
Addressing performance degradation in unsupervised reasoning enhancement methods
Developing curriculum learning for robust reasoning in resource-constrained models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses curriculum learning for progressive difficulty
Masks no-majority rollouts during training
Implements data curation for difficulty control
🔎 Similar Papers
No similar papers found.