Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading

📅 2025-04-16

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address the high GPU instance costs and inaccurate virtual machine (VM) selection in cloud-based large language model (LLM) inference, this paper proposes InferSave—a cost-effective VM recommendation framework. Methodologically, InferSave jointly models KV cache hierarchical offloading and a Computation Time Calibration Function (CTCF), integrated with SLO-driven resource modeling and precise GPU memory demand estimation. Crucially, CTCF corrects discrepancies between theoretical performance predictions and actual GPU execution latency, overcoming accuracy limitations of conventional VM selection approaches. Evaluation on AWS demonstrates that InferSave reduces online inference costs by up to 73.7% and, when combined with KV cache offloading, achieves an additional 20.19% cost reduction for offline workloads. The framework thus enables highly cost-efficient, SLO-compliant LLM inference deployment in cloud environments.

Technology Category

Application Category

📝 Abstract

LLM inference is essential for applications like text summarization, translation, and data analysis, but the high cost of GPU instances from Cloud Service Providers (CSPs) like AWS is a major burden. This paper proposes InferSave, a cost-efficient VM selection framework for cloud based LLM inference. InferSave optimizes KV cache offloading based on Service Level Objectives (SLOs) and workload charac teristics, estimating GPU memory needs, and recommending cost-effective VM instances. Additionally, the Compute Time Calibration Function (CTCF) improves instance selection accuracy by adjusting for discrepancies between theoretical and actual GPU performance. Experiments on AWS GPU instances show that selecting lower-cost instances without KV cache offloading improves cost efficiency by up to 73.7% for online workloads, while KV cache offloading saves up to 20.19% for offline workloads.

Problem

Research questions and friction points this paper is trying to address.

Optimize VM selection for cost-efficient LLM cloud serving

Reduce GPU memory costs via KV cache offloading strategy

Improve instance selection accuracy with performance calibration

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache offloading optimizes GPU memory usage

Compute Time Calibration enhances VM selection accuracy

Cost-efficient VM selection reduces cloud expenses

🔎 Similar Papers

No similar papers found.