π€ AI Summary
To address the high GPU instance costs and inaccurate virtual machine (VM) selection in cloud-based large language model (LLM) inference, this paper proposes InferSaveβa cost-effective VM recommendation framework. Methodologically, InferSave jointly models KV cache hierarchical offloading and a Computation Time Calibration Function (CTCF), integrated with SLO-driven resource modeling and precise GPU memory demand estimation. Crucially, CTCF corrects discrepancies between theoretical performance predictions and actual GPU execution latency, overcoming accuracy limitations of conventional VM selection approaches. Evaluation on AWS demonstrates that InferSave reduces online inference costs by up to 73.7% and, when combined with KV cache offloading, achieves an additional 20.19% cost reduction for offline workloads. The framework thus enables highly cost-efficient, SLO-compliant LLM inference deployment in cloud environments.
π Abstract
LLM inference is essential for applications like text summarization, translation, and data analysis, but the high cost of GPU instances from Cloud Service Providers (CSPs) like AWS is a major burden. This paper proposes InferSave, a cost-efficient VM selection framework for cloud based LLM inference. InferSave optimizes KV cache offloading based on Service Level Objectives (SLOs) and workload charac teristics, estimating GPU memory needs, and recommending cost-effective VM instances. Additionally, the Compute Time Calibration Function (CTCF) improves instance selection accuracy by adjusting for discrepancies between theoretical and actual GPU performance. Experiments on AWS GPU instances show that selecting lower-cost instances without KV cache offloading improves cost efficiency by up to 73.7% for online workloads, while KV cache offloading saves up to 20.19% for offline workloads.