🤖 AI Summary
This work addresses the challenges in existing reinforcement learning–based approaches for large language models, which often struggle to effectively enhance reasoning capabilities due to high annotation costs, model collapse, or reward hacking. The authors propose EasyRL, a novel framework that, for the first time, incorporates human cognitive learning curves into reinforcement learning for large language models. By combining supervised reinforcement learning initialization with a difficulty-progressive self-evolution mechanism, EasyRL leverages only 10% of easily annotated data and employs consistency-based selection, reflective parsing, and iterative pseudo-label generation to enable divide-and-conquer self-training on unlabeled hard examples. The method substantially outperforms current state-of-the-art baselines on mathematical and scientific reasoning benchmarks, achieving data-efficient and robust model self-evolution.
📝 Abstract
Previous LLMs-based RL studies typically follow either supervised learning with high annotation costs, or unsupervised paradigms using voting or entropy-based rewards. However, their performance remains far from satisfactory due to the substantial annotation cost and issues such as model collapse or reward hacking. To address these issues, we introduce a new perspective inspired by cognitive learning theory and propose a novel approach called EasyRL. The core of EasyRL is to simulate the human cognitive acquisition curve by integrating reliable knowledge transfer from easy labeled data with a progressive divide-and-conquer strategy that tackles increasingly difficult unlabeled data. Specifically, we initialize a warm-up model using supervised RL with few-shot labeled data. This is followed by a divide-and-conquer pseudo-labeling strategy on difficult unlabeled data, combining consistency-based selection for low-uncertainty cases and reflection-based resolution for medium-uncertainty cases. Finally, difficulty-progressive self-training with iterative pseudo-labeling and RL further strengthens the model's reasoning capability. EasyRL provides a unified self-evolving framework that facilitates data-efficient post-training of LLMs. Experimental results on mathematical and scientific benchmarks demonstrate that EasyRL, using only 10% of easy labeled data, consistently outperforms state-of-the-art baselines.