🤖 AI Summary
This work addresses the high computational cost of reinforcement learning (RL) fine-tuning for large language models, which stems from the need for numerous rollouts, and the limitations of existing prompt selection strategies that either rely on expensive evaluations or fail to generalize to new prompts. To overcome these challenges, the authors propose GPS (Generalizable Prompt Selection), a framework that employs a lightweight generative model to perform Bayesian inference on prompt difficulty. GPS selects informative prompt batches for RL post-training by prioritizing prompts of moderate difficulty and ensuring historical diversity. Notably, GPS achieves the first online prompt selection method with cross-prompt generalization, leveraging optimization history as a shared prior. Experiments demonstrate that GPS significantly improves training efficiency, final model performance, and test-time compute allocation across multiple reasoning benchmarks.
📝 Abstract
Reinforcement learning enhances the reasoning capabilities of large language models but often involves high computational costs due to rollout-intensive optimization. Online prompt selection presents a plausible solution by prioritizing informative prompts to improve training efficiency. However, current methods either depend on costly, exact evaluations or construct prompt-specific predictive models lacking generalization across prompts. This study introduces Generalizable Predictive Prompt Selection (GPS), which performs Bayesian inference towards prompt difficulty using a lightweight generative model trained on the shared optimization history. Intermediate-difficulty prioritization and history-anchored diversity are incorporated into the batch acquisition principle to select informative prompt batches. The small predictive model also generalizes at test-time for efficient computational allocation. Experiments across varied reasoning benchmarks indicate GPS's substantial improvements in training efficiency, final performance, and test-time efficiency over superior baseline methods.