Sleep-time Compute: Beyond Inference Scaling at Test-time

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the high latency and computational cost of large language models (LLMs) during inference by proposing a novel paradigm—“sleep-phase computation”: offline pre-reasoning over context and pre-computation of critical intermediate quantities prior to query arrival, thereby shifting inference load forward. Methodologically, it introduces the first formal definition of sleep-phase computation, designs stateful reasoning task modeling (Stateful GSM-Symbolic/AIME), enables multi-query context sharing, and proposes an optimization framework for pre-computation strategies—leveraging query predictability to amortize costs across queries. Experiments on Stateful GSM-Symbolic and AIME benchmarks demonstrate a 5× reduction in inference-time computation, with accuracy improvements of 13% and 18%, respectively; under multi-query settings, per-query average inference cost decreases by 2.5×.

Technology Category

Application Category

📝 Abstract

Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to"think"offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.

Problem

Research questions and friction points this paper is trying to address.

Reducing test-time compute costs for large language models

Pre-computing responses to anticipated user queries offline

Improving accuracy and efficiency in reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-computes useful quantities offline

Reduces test-time compute by 5x

Amortizes cost across related queries

🔎 Similar Papers

No similar papers found.