Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the high computational cost of large language models in complex reasoning tasks, where existing routing strategies either rely on handcrafted rules or require training large reward models, struggling to balance efficiency and performance. The authors formulate step-wise model routing as a constrained decision-making problem and propose a lightweight reinforcement learning–based policy-guided routing mechanism. By integrating threshold calibration with chain-of-thought (CoT) state awareness, their approach enables dynamic resource allocation across language models of varying scales. Notably, it operates without manual rules or large reward models, achieving significant improvements over conventional routing methods on GSM8K, MATH500, and OmniMath benchmarks while matching the accuracy–cost trade-off of approaches that depend on large reward models.

📝 Abstract

Inference-time computation has greatly enhanced the performance of large language models (LLMs) on challenging reasoning tasks, but this strategy can incur high inference costs. One solution is to route intermediate chain-of-thought (CoT) states to language models of different sizes; however, existing approaches rely on handcrafted routing strategies that limit performance, or on training large process reward models that may be infeasible in many applications. We formulate stepwise model routing as a constrained decision-making problem, which we solve by training a small control policy using reinforcement learning in conjunction with threshold calibration to tune the performance-efficiency tradeoff. We validate our method on three math benchmarks (GSM8K, MATH500, and OmniMath) on both open and closed models. Our method consistently improves the accuracy-cost tradeoff compared to handcrafted approaches, while achieving a comparable tradeoff to methods that require training large process reward models.

Problem

Research questions and friction points this paper is trying to address.

cost-effective reasoning

model routing

chain-of-thought

inference cost

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

stepwise model routing

reinforcement learning

cost-effective reasoning