Test-Time Compute Games

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the inefficiency of prevailing pay-per-computation pricing models in the LLM-as-a-service market, which often lead to high user costs without commensurate gains in output quality. To remedy this, the paper introduces, for the first time, a mechanism design approach to LLM inference pricing, proposing a reverse second-price auction where service providers bid their prices alongside expected output quality, and users pay the marginal value of the winning bid relative to the second-best alternative. This incentive-compatible mechanism discourages wasteful computation while aligning provider incentives with quality delivery. Empirical evaluations on distilled variants of Llama, Qwen, and DeepSeek-R1 across mathematical and scientific benchmarks demonstrate that the proposed method substantially reduces user expenditure while maintaining or even improving output quality, thereby enhancing overall market social efficiency.

Technology Category

Application Category

📝 Abstract

Test-time compute has emerged as a promising strategy to enhance the reasoning abilities of large language models (LLMs). However, this strategy has in turn increased how much users pay cloud-based providers offering LLM-as-a-service, since providers charge users for the amount of test-time compute they use to generate an output. In our work, we show that the market of LLM-as-a-service is socially inefficient: providers have a financial incentive to increase the amount of test-time compute, even if this increase contributes little to the quality of the outputs. To address this inefficiency, we introduce a reverse second-price auction mechanism where providers bid their offered price and (expected) quality for the opportunity to serve a user, and users pay proportionally to the marginal value generated by the winning provider relative to the second-highest bidder. To illustrate and complement our theoretical results, we conduct experiments with multiple instruct models from the $\texttt{Llama}$ and $\texttt{Qwen}$ families, as well as reasoning models distilled from $\texttt{DeepSeek-R1}$, on math and science benchmark datasets.

Problem

Research questions and friction points this paper is trying to address.

test-time compute

LLM-as-a-service

market inefficiency

cloud pricing

social efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time compute

LLM-as-a-service

reverse second-price auction