Test-Time Compute Games

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of prevailing pay-per-computation pricing models in the LLM-as-a-service market, which often lead to high user costs without commensurate gains in output quality. To remedy this, the paper introduces, for the first time, a mechanism design approach to LLM inference pricing, proposing a reverse second-price auction where service providers bid their prices alongside expected output quality, and users pay the marginal value of the winning bid relative to the second-best alternative. This incentive-compatible mechanism discourages wasteful computation while aligning provider incentives with quality delivery. Empirical evaluations on distilled variants of Llama, Qwen, and DeepSeek-R1 across mathematical and scientific benchmarks demonstrate that the proposed method substantially reduces user expenditure while maintaining or even improving output quality, thereby enhancing overall market social efficiency.

Technology Category

Application Category

📝 Abstract
Test-time compute has emerged as a promising strategy to enhance the reasoning abilities of large language models (LLMs). However, this strategy has in turn increased how much users pay cloud-based providers offering LLM-as-a-service, since providers charge users for the amount of test-time compute they use to generate an output. In our work, we show that the market of LLM-as-a-service is socially inefficient: providers have a financial incentive to increase the amount of test-time compute, even if this increase contributes little to the quality of the outputs. To address this inefficiency, we introduce a reverse second-price auction mechanism where providers bid their offered price and (expected) quality for the opportunity to serve a user, and users pay proportionally to the marginal value generated by the winning provider relative to the second-highest bidder. To illustrate and complement our theoretical results, we conduct experiments with multiple instruct models from the $\texttt{Llama}$ and $\texttt{Qwen}$ families, as well as reasoning models distilled from $\texttt{DeepSeek-R1}$, on math and science benchmark datasets.
Problem

Research questions and friction points this paper is trying to address.

test-time compute
LLM-as-a-service
market inefficiency
cloud pricing
social efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time compute
LLM-as-a-service
reverse second-price auction
social efficiency
marginal value pricing
🔎 Similar Papers
No similar papers found.
Ander Artola Velasco
Ander Artola Velasco
PhD candidate, Max Planck Institute for Software Systems
StatisticsMachine Learning
D
Dimitrios Rontogiannis
Max Planck Institute for Software Systems, Kaiserslautern, Germany
Stratis Tsirtsis
Stratis Tsirtsis
Max Planck Institute for Software Systems
machine learningdecision makingcausalitygame theory
M
Manuel Gomez-Rodriguez
Max Planck Institute for Software Systems, Kaiserslautern, Germany