Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning

📅 2026-04-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

213K/year
🤖 AI Summary
This work addresses the challenges in existing reinforcement learning–based approaches for large language models, which often struggle to effectively enhance reasoning capabilities due to high annotation costs, model collapse, or reward hacking. The authors propose EasyRL, a novel framework that, for the first time, incorporates human cognitive learning curves into reinforcement learning for large language models. By combining supervised reinforcement learning initialization with a difficulty-progressive self-evolution mechanism, EasyRL leverages only 10% of easily annotated data and employs consistency-based selection, reflective parsing, and iterative pseudo-label generation to enable divide-and-conquer self-training on unlabeled hard examples. The method substantially outperforms current state-of-the-art baselines on mathematical and scientific reasoning benchmarks, achieving data-efficient and robust model self-evolution.

Technology Category

Application Category

📝 Abstract
Previous LLMs-based RL studies typically follow either supervised learning with high annotation costs, or unsupervised paradigms using voting or entropy-based rewards. However, their performance remains far from satisfactory due to the substantial annotation cost and issues such as model collapse or reward hacking. To address these issues, we introduce a new perspective inspired by cognitive learning theory and propose a novel approach called EasyRL. The core of EasyRL is to simulate the human cognitive acquisition curve by integrating reliable knowledge transfer from easy labeled data with a progressive divide-and-conquer strategy that tackles increasingly difficult unlabeled data. Specifically, we initialize a warm-up model using supervised RL with few-shot labeled data. This is followed by a divide-and-conquer pseudo-labeling strategy on difficult unlabeled data, combining consistency-based selection for low-uncertainty cases and reflection-based resolution for medium-uncertainty cases. Finally, difficulty-progressive self-training with iterative pseudo-labeling and RL further strengthens the model's reasoning capability. EasyRL provides a unified self-evolving framework that facilitates data-efficient post-training of LLMs. Experimental results on mathematical and scientific benchmarks demonstrate that EasyRL, using only 10% of easy labeled data, consistently outperforms state-of-the-art baselines.
Problem

Research questions and friction points this paper is trying to address.

annotation cost
model collapse
reward hacking
data efficiency
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

EasyRL
data-efficient reinforcement learning
self-evolving LLMs
pseudo-labeling
cognitive-inspired learning