Accelerating LLM Reasoning via Early Rejection with Partial Reward Modeling

📅 2025-08-03

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Large language models (LLMs) incur prohibitive computational overhead during complex mathematical and logical reasoning—especially when employing process reward models (PRMs) to parallelly generate numerous reasoning paths. Method: This paper proposes a dynamic early pruning mechanism guided by intermediate reward signals. It couples PRMs with beam search, enabling adaptive rejection of low-quality reasoning paths after each decoding step. Crucially, it establishes, for the first time, the statistical reliability of token-level partial rewards produced by PRMs, thereby supporting principled path pruning during generation. Results: Evaluated on mainstream mathematical reasoning benchmarks, the method reduces inference FLOPs by 1.4×–9× while preserving final accuracy. This yields significant gains in inference efficiency and scalability without compromising solution quality.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly relied upon for solving complex reasoning tasks in domains such as mathematics, logic, and multi-step question answering. A growing line of work seeks to improve reasoning quality by scaling inference time compute particularly through Process Reward Models (PRMs), used to reward the reasoning at intermediate steps. While effective, these methods introduce substantial computational overhead, especially when generating large numbers of solutions in parallel. In this paper, we investigate whether PRMs can be used mid-generation to provide early signals that enable the rejection of suboptimal candidates before full generation of step is complete. We introduce the hypothesis that PRMs are also Partial Reward Models, meaning that the scores they assign to partially completed reasoning step are predictive of final output quality. This allows for principled early rejection based on intermediate token-level signals. We support this hypothesis both theoretically, by proving that the risk of discarding optimal beams decreases exponentially with generation length and empirically, by demonstrating a strong correlation between partial and final rewards across multiple reward models. On math reasoning benchmarks, our method achieves up to 1.4$ imes$-9$ imes$ reduction in inference FLOPs without degrading final performance. These results suggest that early rejection is a powerful mechanism for improving the compute-efficiency of reasoning in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational overhead in LLM reasoning tasks

Using PRMs for early rejection of suboptimal solutions

Improving compute-efficiency without degrading final performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Early rejection using partial reward modeling

Token-level signals predict final output quality

Reduces FLOPs without performance degradation

🔎 Similar Papers

Interpretable Contrastive Monte Carlo Tree Search Reasoning