Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
This paper addresses key limitations of Chain-of-Thought (CoT) reasoning—high computational cost, fixed inference length, and non-adaptive model selection—by proposing the Re-FORC framework. Re-FORC introduces a lightweight, context-aware adapter that enables real-time prediction of future rewards during inference, thereby dynamically governing early termination, optimal model switching, and step-wise expansion of reasoning depth; it further explicitly models per-token computational cost as a dynamic control signal. Its core innovation lies in jointly optimizing inference length and model scale through a runtime cost-sensitive adaptive scaling mechanism. Experiments demonstrate that Re-FORC reduces computation by 26% while preserving accuracy; improves accuracy by 4% under fixed computational budgets; cuts computational cost by 55% at equivalent accuracy; and boosts accuracy by 11% and 7% in high- and low-resource settings, respectively.

Technology Category

Application Category

📝 Abstract
We propose Re-FORC, an adaptive reward prediction method that, given a context, enables prediction of the expected future rewards as a function of the number of future thinking tokens. Re-FORC trains a lightweight adapter on reasoning models, demonstrating improved prediction with longer reasoning and larger models. Re-FORC enables: 1) early stopping of unpromising reasoning chains, reducing compute by 26% while maintaining accuracy, 2) optimized model and thinking length selection that achieves 4% higher accuracy at equal compute and 55% less compute at equal accuracy compared to the largest model, 3) adaptive test-time scaling, which increases accuracy by 11% in high compute regime, and 7% in low compute regime. Re-FORC allows dynamic reasoning with length control via cost-per-token thresholds while estimating computation time upfront.
Problem

Research questions and friction points this paper is trying to address.

Predicts future rewards for reasoning chain length
Enables early stopping of unpromising reasoning chains
Optimizes model selection and thinking length dynamically
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive reward prediction for reasoning chains
Lightweight adapter training on reasoning models
Dynamic reasoning with length control thresholds