🤖 AI Summary
This paper addresses key limitations of Chain-of-Thought (CoT) reasoning—high computational cost, fixed inference length, and non-adaptive model selection—by proposing the Re-FORC framework. Re-FORC introduces a lightweight, context-aware adapter that enables real-time prediction of future rewards during inference, thereby dynamically governing early termination, optimal model switching, and step-wise expansion of reasoning depth; it further explicitly models per-token computational cost as a dynamic control signal. Its core innovation lies in jointly optimizing inference length and model scale through a runtime cost-sensitive adaptive scaling mechanism. Experiments demonstrate that Re-FORC reduces computation by 26% while preserving accuracy; improves accuracy by 4% under fixed computational budgets; cuts computational cost by 55% at equivalent accuracy; and boosts accuracy by 11% and 7% in high- and low-resource settings, respectively.
📝 Abstract
We propose Re-FORC, an adaptive reward prediction method that, given a context, enables prediction of the expected future rewards as a function of the number of future thinking tokens. Re-FORC trains a lightweight adapter on reasoning models, demonstrating improved prediction with longer reasoning and larger models. Re-FORC enables: 1) early stopping of unpromising reasoning chains, reducing compute by 26% while maintaining accuracy, 2) optimized model and thinking length selection that achieves 4% higher accuracy at equal compute and 55% less compute at equal accuracy compared to the largest model, 3) adaptive test-time scaling, which increases accuracy by 11% in high compute regime, and 7% in low compute regime. Re-FORC allows dynamic reasoning with length control via cost-per-token thresholds while estimating computation time upfront.