Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning

📅 2025-04-04

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

In reasoning-oriented reinforcement learning (RORL), inefficient training arises under sparse rewards due to mismatched problem difficulty. Method: This paper proposes the first online dynamic difficulty selection framework with theoretical guarantees. Leveraging a KL-divergence lower bound, we rigorously prove that batch-wise filtering of intermediate-accuracy problems maximizes policy update efficiency. Our method integrates accuracy-variance–aware problem selection, dynamic batch resampling, GRPO-extended policy gradient optimization, and math-task–adaptive difficulty modeling. Results: Evaluated on five mathematical reasoning benchmarks, our approach achieves +10% on AIME score and +4% average accuracy, while surpassing the best baseline reward using only 60% of the training time—demonstrating substantial improvements in both sample and computational efficiency.

Technology Category

Application Category

📝 Abstract

Reasoning-Oriented Reinforcement Learning (RORL) enhances the reasoning ability of Large Language Models (LLMs). However, due to the sparsity of rewards in RORL, effective training is highly dependent on the selection of problems of appropriate difficulty. Although curriculum learning attempts to address this by adjusting difficulty, it often relies on static schedules, and even recent online filtering methods lack theoretical grounding and a systematic understanding of their effectiveness. In this work, we theoretically and empirically show that curating the batch with the problems that the training model achieves intermediate accuracy on the fly can maximize the effectiveness of RORL training, namely balanced online difficulty filtering. We first derive that the lower bound of the KL divergence between the initial and the optimal policy can be expressed with the variance of the sampled accuracy. Building on those insights, we show that balanced filtering can maximize the lower bound, leading to better performance. Experimental results across five challenging math reasoning benchmarks show that balanced online filtering yields an additional 10% in AIME and 4% improvements in average over plain GRPO. Moreover, further analysis shows the gains in sample efficiency and training time efficiency, exceeding the maximum reward of plain GRPO within 60% training time and the volume of the training set.

Problem

Research questions and friction points this paper is trying to address.

Optimizing problem difficulty selection for Reasoning-Oriented Reinforcement Learning (RORL).

Addressing reward sparsity in RORL via dynamic difficulty filtering.

Maximizing training effectiveness with balanced online difficulty filtering.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Balanced online difficulty filtering maximizes RORL training

KL divergence lower bound expressed with accuracy variance

Improves performance and efficiency in math reasoning benchmarks

🔎 Similar Papers

Boosting Hierarchical Reinforcement Learning with Meta-Learning for Complex Task Adaptation