Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of reinforcement learning (RL) fine-tuning for large language models, which stems from the need for numerous rollouts, and the limitations of existing prompt selection strategies that either rely on expensive evaluations or fail to generalize to new prompts. To overcome these challenges, the authors propose GPS (Generalizable Prompt Selection), a framework that employs a lightweight generative model to perform Bayesian inference on prompt difficulty. GPS selects informative prompt batches for RL post-training by prioritizing prompts of moderate difficulty and ensuring historical diversity. Notably, GPS achieves the first online prompt selection method with cross-prompt generalization, leveraging optimization history as a shared prior. Experiments demonstrate that GPS significantly improves training efficiency, final model performance, and test-time compute allocation across multiple reasoning benchmarks.

Technology Category

Application Category

📝 Abstract
Reinforcement learning enhances the reasoning capabilities of large language models but often involves high computational costs due to rollout-intensive optimization. Online prompt selection presents a plausible solution by prioritizing informative prompts to improve training efficiency. However, current methods either depend on costly, exact evaluations or construct prompt-specific predictive models lacking generalization across prompts. This study introduces Generalizable Predictive Prompt Selection (GPS), which performs Bayesian inference towards prompt difficulty using a lightweight generative model trained on the shared optimization history. Intermediate-difficulty prioritization and history-anchored diversity are incorporated into the batch acquisition principle to select informative prompt batches. The small predictive model also generalizes at test-time for efficient computational allocation. Experiments across varied reasoning benchmarks indicate GPS's substantial improvements in training efficiency, final performance, and test-time efficiency over superior baseline methods.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
prompt selection
training efficiency
generalization
large reasoning models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalizable Predictive Prompt Selection
Bayesian inference
prompt difficulty
training efficiency
reinforcement learning
🔎 Similar Papers
No similar papers found.
Y
Yun Qu
Department of Automation, Tsinghua University
Qi Wang
Qi Wang
Tsinghua University
Operation researchReinforcement learning
Y
Yixiu Mao
Department of Automation, Tsinghua University
Heming Zou
Heming Zou
Tsinghua University
Machine Learning
Yuhang Jiang
Yuhang Jiang
Tsinghua University
Reinforcement LearningMachine Learning
Weijie Liu
Weijie Liu
Nankai University
System SecurityVirtualizationBinary AnalysisImage Fusion
C
Clive Bai
LLM Department, Tencent
Kai Yang
Kai Yang
Tencent Hunyuan
Reinforcement LearningLLM
Y
Yangkun Chen
LLM Department, Tencent
S
Saiyong Yang
LLM Department, Tencent
X
Xiangyang Ji
Department of Automation, Tsinghua University