Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This study addresses the systematic trade-off between cost and quality in cascaded deployment of large language models (LLMs), eschewing heuristic confidence thresholds. By formulating the problem through constrained optimization and leveraging Lagrangian duality, the work establishes—for the first time—the cost–quality Pareto frontier for LLM cascades, revealing its piecewise concave structure. It proves that the optimal policy is composed of pairwise cascade envelopes and demonstrates that structural generation overhead, rather than the number of stages, constitutes the primary performance bottleneck. The authors propose a lightweight pre-generation routing mechanism; extensive experiments across five benchmarks and eight models show that full-chain cascades underperform relative to pairwise envelopes, optimizing subsequences yields negligible gains, and the proposed router outperforms the best cascade on four to five datasets—primarily by avoiding the upfront generation cost of cheaper models.

📝 Abstract

Model cascades, in which a cheap LLM defers to an expensive one on low-confidence queries, are widely used to navigate the cost-quality tradeoff at deployment. Existing approaches largely treat the deferral threshold as an empirical hyperparameter, with limited guidance on the geometry of the resulting cost-quality frontier over a model pool. We develop a decision-theoretic framework grounded in constrained optimization and duality. For a two-model cascade, we establish piecewise concavity of the cost-quality frontier on decreasing-benefit regions of the confidence support, with reciprocal shadow prices linking the budget- and quality-constrained formulations. Given a pool of $k$ models, we characterize the frontier achievable by deterministic two-model threshold cascades as the pointwise envelope over $\binom{k}{2}$ pairwise cascades, with switching points where the optimal pair changes. For $k$-model cascades, we derive first-order conditions in which a single shadow price equalizes marginal quality-per-cost across stage boundaries. We validate the framework on five benchmarks (MATH, MMLU, TriviaQA, SimpleQA, LiveCodeBench) across eight models from five providers. Within the deterministic threshold-cascade class, full fixed chains underperform the pairwise envelope, and optimized subsequence cascades do not deliver practically meaningful held-out gains over it. A lightweight pre-generation router exceeds the best cascade policy on four of five datasets, mainly because it avoids the cheap model's generation cost on queries sent directly to a larger model rather than because of a stronger routing signal. These results suggest that cascade performance is limited primarily by structural cost, since cascades pay the cheap model before any escalation decision, rather than by a shortage of intermediate stages.

Problem

Research questions and friction points this paper is trying to address.

LLM cascades

cost-quality tradeoff

deferral threshold

decision-theoretic framework

model routing

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM cascades

decision-theoretic framework

cost-quality tradeoff