Train at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the high computational cost in reinforcement learning–based training of large language models, which stems from a vast number of inefficient prompts—particularly wasteful in multi-turn reasoning tasks. To tackle this, the paper introduces HIVE, a novel framework that reveals, for the first time, that prompt utility is concentrated along a dynamically evolving “learning frontier.” HIVE employs a two-stage online prompt filtering mechanism: an initial coarse filtering based on historical reward trajectories, followed by real-time entropy-based pruning to discard samples with diminishing utility. Seamlessly integrated into GRPO-style algorithms, HIVE significantly reduces the number of rollouts across multiple mathematical reasoning benchmarks, substantially improving sample efficiency while maintaining or even enhancing final task performance.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has become essential for post-training large language models (LLMs) in reasoning tasks. While scaling rollouts can stabilize training and enhance performance, the computational overhead is a critical issue. In algorithms like GRPO, multiple rollouts per prompt incur prohibitive costs, as a large portion of prompts provide negligible gradients and are thus of low utility. To address this problem, we investigate how to select high-utility prompts before the rollout phase. Our experimental analysis reveals that sample utility is non-uniform and evolving: the strongest learning signals concentrate at the ``learning edge", the intersection of intermediate difficulty and high uncertainty, which shifts as training proceeds. Motivated by this, we propose HIVE (History-Informed and online-VErified prompt selection), a dual-stage framework for data-efficient RL. HIVE utilizes historical reward trajectories for coarse selection and employs prompt entropy as a real-time proxy to prune instances with stale utility. By evaluating HIVE across multiple math reasoning benchmarks and models, we show that HIVE yields significant rollout efficiency without compromising performance.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

prompt selection

large reasoning model

rollout efficiency

computational overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt selection

reinforcement learning

learning edge