🤖 AI Summary
This work addresses the inefficiency of prevailing pay-per-computation pricing models in the LLM-as-a-service market, which often lead to high user costs without commensurate gains in output quality. To remedy this, the paper introduces, for the first time, a mechanism design approach to LLM inference pricing, proposing a reverse second-price auction where service providers bid their prices alongside expected output quality, and users pay the marginal value of the winning bid relative to the second-best alternative. This incentive-compatible mechanism discourages wasteful computation while aligning provider incentives with quality delivery. Empirical evaluations on distilled variants of Llama, Qwen, and DeepSeek-R1 across mathematical and scientific benchmarks demonstrate that the proposed method substantially reduces user expenditure while maintaining or even improving output quality, thereby enhancing overall market social efficiency.
📝 Abstract
Test-time compute has emerged as a promising strategy to enhance the reasoning abilities of large language models (LLMs). However, this strategy has in turn increased how much users pay cloud-based providers offering LLM-as-a-service, since providers charge users for the amount of test-time compute they use to generate an output. In our work, we show that the market of LLM-as-a-service is socially inefficient: providers have a financial incentive to increase the amount of test-time compute, even if this increase contributes little to the quality of the outputs. To address this inefficiency, we introduce a reverse second-price auction mechanism where providers bid their offered price and (expected) quality for the opportunity to serve a user, and users pay proportionally to the marginal value generated by the winning provider relative to the second-highest bidder. To illustrate and complement our theoretical results, we conduct experiments with multiple instruct models from the $\texttt{Llama}$ and $\texttt{Qwen}$ families, as well as reasoning models distilled from $\texttt{DeepSeek-R1}$, on math and science benchmark datasets.