🤖 AI Summary
This work addresses the challenge of low sample efficiency in reinforcement learning for large language model reasoning tasks, which often stems from sparse reward signals. To overcome this limitation, the authors propose Goldilocks, a method that employs a teacher–student architecture wherein the teacher dynamically estimates the difficulty of each problem relative to the current student model’s capability. Guided by the Goldilocks principle—selecting samples that are neither too easy nor too hard—the approach adaptively curates a dynamic curriculum that evolves with the student’s learning progress. Integrated with the GRPO algorithm, Goldilocks achieves significantly better performance than standard GRPO training under the same computational budget on the OpenMathReasoning dataset, thereby surpassing the constraints of conventional static curriculum designs.
📝 Abstract
Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student's performance on seen samples, the teacher continuously adapts to the student's evolving abilities. On OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.