Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?

📅 2025-07-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of online prompt difficulty estimation in reinforcement learning (RL) fine-tuning of large language models (LLMs), where inaccurate difficulty assessment incurs high inference overhead. We propose Model Predictive Prompt Selection (MoPPS), a novel framework that for the first time models prompt difficulty as a latent variable and enables LLM-free online difficulty estimation via streaming Bayesian inference and posterior sampling. MoPPS integrates this estimator with a multi-armed bandit mechanism for adaptive prompt selection. Evaluated on mathematical reasoning, planning, and visual geometry tasks, MoPPS reduces LLM rollout counts by 42% on average, accelerates RL fine-tuning convergence, and maintains high-accuracy difficulty prediction. Its core contribution is the first Bayesian risk prediction paradigm for prompt difficulty that operates without LLM interaction and supports dynamic, real-time updates.

Technology Category

Application Category

📝 Abstract
Recent advances have witnessed the effectiveness of reinforcement learning (RL) finetuning in enhancing the reasoning capabilities of large language models (LLMs). The optimization process often requires numerous iterations to achieve satisfactory performance, resulting in high computational costs due to the need for frequent prompt evaluations under intensive LLM interactions and repeated policy updates. Appropriate online prompt selection methods reduce iteration steps by prioritizing informative prompts during training, while the pipeline's reliance on exhaustive prompt evaluation and subset selection for optimization still incurs substantial computational overhead due to frequent LLM inference calls. Distinguished from these direct evaluate-then-select schemes, this work investigates iterative approximate evaluation for arbitrary prompts and introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework that online estimates prompt difficulty without requiring costly LLM interactions. Technically, MoPPS models each prompt's success rate as a latent variable, performs streaming Bayesian inference, and employs posterior sampling in a constructed multi-armed bandit machine, enabling sample efficient and adaptive prompt selection. Extensive experiments across mathematics, planning, and vision-based geometry tasks show that MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced LLM rollouts.
Problem

Research questions and friction points this paper is trying to address.

Predict prompt difficulty online to accelerate RL finetuning
Reduce computational costs from frequent LLM interactions
Enable efficient prompt selection without exhaustive evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian risk-predictive framework for prompt selection
Models prompt success rate as latent variable
Streaming Bayesian inference for adaptive selection
🔎 Similar Papers
No similar papers found.
Y
Yun Qu
Department of Automation, Tsinghua University
Q
Qi Cheems Wang
Department of Automation, Tsinghua University
Y
Yixiu Mao
Department of Automation, Tsinghua University
Vincent Tao Hu
Vincent Tao Hu
Ommer-Lab PostDoc | University of Amsterdam | PKU
generative modelingvisual generation
X
Xiangyang Ji
Department of Automation, Tsinghua University