🤖 AI Summary
This study investigates the trade-off between performance and cost in large language model (LLM) services, along with the impact of routing strategies on user behavior. By formulating a Stackelberg game between a service provider and users—where the provider routes tasks between a standard and an inference-optimized model, and users decide whether to retry or abandon based on response utility and latency—the analysis reveals a fundamental misalignment between the provider’s optimal routing policy and user preferences. The work demonstrates that static, non-cascaded routing is typically optimal and identifies conditions under which the provider may intentionally increase latency to reduce costs at the expense of user experience. Furthermore, it derives a simple threshold-based routing rule for single-provider, single-user settings and delineates the effective regimes for routing, cascading, and latency modulation.
📝 Abstract
To mitigate the trade-offs between performance and costs, LLM providers route user tasks to different models based on task difficulty and latency. We study the effect of LLM routing with respect to user behavior. We propose a game between an LLM provider with two models (standard and reasoning) and a user who can re-prompt or abandon tasks if the routed model cannot solve them. The user's goal is to maximize their utility minus the delay from using the model, while the provider minimizes the cost of servicing the user. We solve this Stackelberg game by fully characterizing the user best response and simplifying the provider problem. We observe that in nearly all cases, the optimal routing policy involves a static policy with no cascading that depends on the expected utility of the models to the user. Furthermore, we reveal a misalignment gap between the provider-optimal and user-preferred routes when the user's and provider's rankings of the models with respect to utility and cost differ. Finally, we demonstrate conditions for extreme misalignment where providers are incentivized to throttle the latency of the models to minimize their costs, consequently depressing user utility. The results yield simple threshold rules for single-provider, single-user interactions and clarify when routing, cascading, and throttling help or harm.