🤖 AI Summary
To address low GPU utilization during the decoding phase of large language model (LLM) inference—caused by memory bottlenecks and insufficient dynamic batching under variable workloads—this paper proposes a safe co-location framework for inference and parameter-efficient fine-tuning (PEFT). The method introduces three key components: (1) a unified memory allocator enabling cross-workload GPU memory sharing and reuse; (2) a two-stage decoder latency predictor to accurately model inference latency under dynamic loads; and (3) a QoS-aware scheduler enforcing service-level objectives (SLOs). Crucially, the framework maximizes GPU throughput without violating inference latency SLOs. Experimental results demonstrate that fine-tuning throughput increases by 46.2% on average—and up to 92.0%—over baseline approaches, marking the first demonstration of efficient, high-QoS-constrained co-execution of LLM inference and PEFT.
📝 Abstract
Large language models (LLMs) are increasingly deployed under the Model-as-a-Service (MaaS) paradigm. To meet stringent quality-of-service (QoS) requirements, existing LLM serving systems disaggregate the prefill and decode phases of inference. However, decode instances often experience low GPU utilization due to their memory-bound nature and insufficient batching in dynamic workloads, leaving compute resources underutilized.
We introduce Harli, a serving system that improves GPU utilization by co-locating parameter-efficient finetuning (PEFT) tasks with LLM decode instances. PEFT tasks are compute-bound and memory-efficient, making them ideal candidates for safe co-location. Specifically, Harli addresses key challenges--limited memory and unpredictable interference--using three components: a unified memory allocator for runtime memory reuse, a two-stage latency predictor for decode latency modeling, and a QoS-guaranteed throughput-maximizing scheduler for throughput maximization. Experimental results show that Harli improves the finetune throughput by 46.2% on average (up to 92.0%) over state-of-the-art serving systems, while maintaining strict QoS guarantees for inference decode.