🤖 AI Summary
This work proposes a self-curriculum learning framework to reduce the reliance of chain-of-thought reasoning models on labeled data and computational resources during supervised fine-tuning and reinforcement learning. The approach enables the model to dynamically select high-value training examples based on its current capability, focusing on problems it finds challenging. By integrating adaptive data selection, a boosting mechanism, and counterexample learning, the method operates without assumptions about prompt distributions or task difficulty. Theoretical analysis provides the first proof that this strategy exponentially reduces the number of required reasoning demonstrations and decouples reinforcement learning costs from the quality of the reference model. Experiments demonstrate that the method achieves performance close to that of optimal models while substantially decreasing both annotation requirements and computational overhead.
📝 Abstract
Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that autocurriculum, where the model uses its own performance to decide which problems to focus training on, provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum decouples the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting and learning from counterexamples, and requiring no assumption on the distribution or difficulty of prompts.