🤖 AI Summary
This work addresses the cost–latency trade-off in large language model (LLM) inference at scale by introducing the first systematic economic analysis framework. Methodologically, it jointly models computational cost, memory and network bandwidth bottlenecks, and end-to-end latency constraints to derive the Pareto frontier between per-token cost and generation throughput across hardware configurations. The framework integrates arithmetic intensity analysis, bandwidth-aware parallelism modeling, and latency-sensitive batch-size optimization. Empirical evaluation across multiple state-of-the-art LLMs reveals universal principles governing optimal parallelization strategies and batch sizes under hardware constraints. The results provide quantifiable, decision-ready guidance for industrial LLM deployment—enabling optimal hardware selection and scheduling configuration under given budget and latency requirements.
📝 Abstract
We develop a theoretical model that addresses the economic trade-off between cost per token versus serial token generation speed when deploying LLMs for inference at scale. Our model takes into account arithmetic, memory bandwidth, network bandwidth and latency constraints; and optimizes over different parallelism setups and batch sizes to find the ones that optimize serial inference speed at a given cost per token. We use the model to compute Pareto frontiers of serial speed versus cost per token for popular language models.