Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses key limitations of Chain-of-Thought (CoT) reasoning—high computational cost, fixed inference length, and non-adaptive model selection—by proposing the Re-FORC framework. Re-FORC introduces a lightweight, context-aware adapter that enables real-time prediction of future rewards during inference, thereby dynamically governing early termination, optimal model switching, and step-wise expansion of reasoning depth; it further explicitly models per-token computational cost as a dynamic control signal. Its core innovation lies in jointly optimizing inference length and model scale through a runtime cost-sensitive adaptive scaling mechanism. Experiments demonstrate that Re-FORC reduces computation by 26% while preserving accuracy; improves accuracy by 4% under fixed computational budgets; cuts computational cost by 55% at equivalent accuracy; and boosts accuracy by 11% and 7% in high- and low-resource settings, respectively.

Technology Category

Application Category

📝 Abstract
We propose Re-FORC, an adaptive reward prediction method that, given a context, enables prediction of the expected future rewards as a function of the number of future thinking tokens. Re-FORC trains a lightweight adapter on reasoning models, demonstrating improved prediction with longer reasoning and larger models. Re-FORC enables: 1) early stopping of unpromising reasoning chains, reducing compute by 26% while maintaining accuracy, 2) optimized model and thinking length selection that achieves 4% higher accuracy at equal compute and 55% less compute at equal accuracy compared to the largest model, 3) adaptive test-time scaling, which increases accuracy by 11% in high compute regime, and 7% in low compute regime. Re-FORC allows dynamic reasoning with length control via cost-per-token thresholds while estimating computation time upfront.
Problem

Research questions and friction points this paper is trying to address.

Predicts future rewards for reasoning chain length
Enables early stopping of unpromising reasoning chains
Optimizes model selection and thinking length dynamically
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive reward prediction for reasoning chains
Lightweight adapter training on reasoning models
Dynamic reasoning with length control thresholds
🔎 Similar Papers
No similar papers found.
Renos Zabounidis
Renos Zabounidis
PhD Student, Carnegie Mellon University
Interpretable Machine LearningMulti-Agent Reinforcement Learning
A
Aditya Golatkar
AWS Agentic AI
M
Michael Kleinman
AWS Agentic AI
A
A. Achille
AWS Agentic AI
W
Wei Xia
AWS Agentic AI
S
S. Soatto
AWS Agentic AI