Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading

πŸ“… 2025-04-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the high GPU instance costs and inaccurate virtual machine (VM) selection in cloud-based large language model (LLM) inference, this paper proposes InferSaveβ€”a cost-effective VM recommendation framework. Methodologically, InferSave jointly models KV cache hierarchical offloading and a Computation Time Calibration Function (CTCF), integrated with SLO-driven resource modeling and precise GPU memory demand estimation. Crucially, CTCF corrects discrepancies between theoretical performance predictions and actual GPU execution latency, overcoming accuracy limitations of conventional VM selection approaches. Evaluation on AWS demonstrates that InferSave reduces online inference costs by up to 73.7% and, when combined with KV cache offloading, achieves an additional 20.19% cost reduction for offline workloads. The framework thus enables highly cost-efficient, SLO-compliant LLM inference deployment in cloud environments.

Technology Category

Application Category

πŸ“ Abstract
LLM inference is essential for applications like text summarization, translation, and data analysis, but the high cost of GPU instances from Cloud Service Providers (CSPs) like AWS is a major burden. This paper proposes InferSave, a cost-efficient VM selection framework for cloud based LLM inference. InferSave optimizes KV cache offloading based on Service Level Objectives (SLOs) and workload charac teristics, estimating GPU memory needs, and recommending cost-effective VM instances. Additionally, the Compute Time Calibration Function (CTCF) improves instance selection accuracy by adjusting for discrepancies between theoretical and actual GPU performance. Experiments on AWS GPU instances show that selecting lower-cost instances without KV cache offloading improves cost efficiency by up to 73.7% for online workloads, while KV cache offloading saves up to 20.19% for offline workloads.
Problem

Research questions and friction points this paper is trying to address.

Optimize VM selection for cost-efficient LLM cloud serving
Reduce GPU memory costs via KV cache offloading strategy
Improve instance selection accuracy with performance calibration
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache offloading optimizes GPU memory usage
Compute Time Calibration enhances VM selection accuracy
Cost-efficient VM selection reduces cloud expenses
πŸ”Ž Similar Papers
No similar papers found.
K
Kihyun Kim
Dept. of Computer Science and Engineering, Sogang University, Seoul, Republic of Korea
J
Jinwoo Kim
Dept. of Computer Science and Engineering, Sogang University, Seoul, Republic of Korea
H
Hyunsun Chung
Dept. of Computer Science and Engineering, Sogang University, Seoul, Republic of Korea
M
Myung-Hoon Cha
ETRI, Daejeon, Republic of Korea
H
Hong-Yeon Kim
ETRI, Daejeon, Republic of Korea
Youngjae Kim
Youngjae Kim
Professor, Department of Computer Science and Engineering, Sogang University
Operating SystemFile and Storage SystemDistributed System