Inference economics of language models

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the cost–latency trade-off in large language model (LLM) inference at scale by introducing the first systematic economic analysis framework. Methodologically, it jointly models computational cost, memory and network bandwidth bottlenecks, and end-to-end latency constraints to derive the Pareto frontier between per-token cost and generation throughput across hardware configurations. The framework integrates arithmetic intensity analysis, bandwidth-aware parallelism modeling, and latency-sensitive batch-size optimization. Empirical evaluation across multiple state-of-the-art LLMs reveals universal principles governing optimal parallelization strategies and batch sizes under hardware constraints. The results provide quantifiable, decision-ready guidance for industrial LLM deployment—enabling optimal hardware selection and scheduling configuration under given budget and latency requirements.

Technology Category

Application Category

📝 Abstract

We develop a theoretical model that addresses the economic trade-off between cost per token versus serial token generation speed when deploying LLMs for inference at scale. Our model takes into account arithmetic, memory bandwidth, network bandwidth and latency constraints; and optimizes over different parallelism setups and batch sizes to find the ones that optimize serial inference speed at a given cost per token. We use the model to compute Pareto frontiers of serial speed versus cost per token for popular language models.

Problem

Research questions and friction points this paper is trying to address.

Economic trade-off between cost and speed in LLM inference

Optimizing parallelism and batch sizes for efficient deployment

Computing Pareto frontiers for speed versus cost per token

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes token cost and speed trade-off

Considers arithmetic and bandwidth constraints

Computes Pareto frontiers for models

🔎 Similar Papers

No similar papers found.