🤖 AI Summary
This work addresses the high computational cost in reinforcement learning–based training of large language models, which stems from a vast number of inefficient prompts—particularly wasteful in multi-turn reasoning tasks. To tackle this, the paper introduces HIVE, a novel framework that reveals, for the first time, that prompt utility is concentrated along a dynamically evolving “learning frontier.” HIVE employs a two-stage online prompt filtering mechanism: an initial coarse filtering based on historical reward trajectories, followed by real-time entropy-based pruning to discard samples with diminishing utility. Seamlessly integrated into GRPO-style algorithms, HIVE significantly reduces the number of rollouts across multiple mathematical reasoning benchmarks, substantially improving sample efficiency while maintaining or even enhancing final task performance.
📝 Abstract
Reinforcement learning (RL) has become essential for post-training large language models (LLMs) in reasoning tasks. While scaling rollouts can stabilize training and enhance performance, the computational overhead is a critical issue. In algorithms like GRPO, multiple rollouts per prompt incur prohibitive costs, as a large portion of prompts provide negligible gradients and are thus of low utility. To address this problem, we investigate how to select high-utility prompts before the rollout phase. Our experimental analysis reveals that sample utility is non-uniform and evolving: the strongest learning signals concentrate at the ``learning edge", the intersection of intermediate difficulty and high uncertainty, which shifts as training proceeds. Motivated by this, we propose HIVE (History-Informed and online-VErified prompt selection), a dual-stage framework for data-efficient RL. HIVE utilizes historical reward trajectories for coarse selection and employs prompt entropy as a real-time proxy to prune instances with stale utility. By evaluating HIVE across multiple math reasoning benchmarks and models, we show that HIVE yields significant rollout efficiency without compromising performance.