🤖 AI Summary
Current evaluations of large language model (LLM) routing are confined to single-turn prompts, failing to capture how intermediate routing decisions impact overall success rates and costs in long-horizon agent tasks. This work proposes TwinRouterBench, the first dual-track benchmark enabling step-level routing evaluation: a static track provides 970 annotated prefix samples for offline iteration, while a dynamic track supports end-to-end online validation on 500 SWE-bench cases. The framework introduces a deterministic scoring mechanism that eliminates the need for online judges and incorporates a degradation-aware cascading protocol to estimate target model tiers and actual API costs. Evaluation on 100 held-out cases demonstrates that the proposed routing strategy effectively balances task success rate against invocation cost.
📝 Abstract
LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single user request triggers many model calls. Routing each call to the cheapest sufficient model can cut costs without sacrificing quality, yet existing router benchmarks evaluate routers only on one-shot prompts. They never expose the router-visible prefix at an intermediate agent step, never test whether a cheaper replacement preserves downstream task success, and often rely on online LLM judges at evaluation time. We introduce TwinRouterBench, a step-level routing benchmark with two tracks. The static track provides 970 router-visible prefixes from 520 instances across SWE-bench, BFCL, mtRAG, QMSum, and PinchBench, each paired with an execution-verified target tier estimated under a released downgrade-and-cascade protocol; scoring is deterministic arithmetic over tier labels, trajectory membership, and token costs, with no online evaluator-side LLM judge. The dynamic track supplies a harness that runs routers on the full 500-case SWE-bench Verified suite; in this paper we report a 100-case held-out evaluation disjoint from the static SWE supervision split. At each LLM call the router selects a concrete model from a locked pool, and success is measured by official task resolution and realized API spend. The two tracks support fast offline iteration followed by end-to-end validation under live agent execution. Code and data are available at https://github.com/CommonstackAI/TwinRouterBench.