Harli: Harvest Underutilized Resources in LLM Serving with Finetuning Tasks

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

To address low GPU utilization during the decoding phase of large language model (LLM) inference—caused by memory bottlenecks and insufficient dynamic batching under variable workloads—this paper proposes a safe co-location framework for inference and parameter-efficient fine-tuning (PEFT). The method introduces three key components: (1) a unified memory allocator enabling cross-workload GPU memory sharing and reuse; (2) a two-stage decoder latency predictor to accurately model inference latency under dynamic loads; and (3) a QoS-aware scheduler enforcing service-level objectives (SLOs). Crucially, the framework maximizes GPU throughput without violating inference latency SLOs. Experimental results demonstrate that fine-tuning throughput increases by 46.2% on average—and up to 92.0%—over baseline approaches, marking the first demonstration of efficient, high-QoS-constrained co-execution of LLM inference and PEFT.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly deployed under the Model-as-a-Service (MaaS) paradigm. To meet stringent quality-of-service (QoS) requirements, existing LLM serving systems disaggregate the prefill and decode phases of inference. However, decode instances often experience low GPU utilization due to their memory-bound nature and insufficient batching in dynamic workloads, leaving compute resources underutilized. We introduce Harli, a serving system that improves GPU utilization by co-locating parameter-efficient finetuning (PEFT) tasks with LLM decode instances. PEFT tasks are compute-bound and memory-efficient, making them ideal candidates for safe co-location. Specifically, Harli addresses key challenges--limited memory and unpredictable interference--using three components: a unified memory allocator for runtime memory reuse, a two-stage latency predictor for decode latency modeling, and a QoS-guaranteed throughput-maximizing scheduler for throughput maximization. Experimental results show that Harli improves the finetune throughput by 46.2% on average (up to 92.0%) over state-of-the-art serving systems, while maintaining strict QoS guarantees for inference decode.

Problem

Research questions and friction points this paper is trying to address.

Improving low GPU utilization in LLM decode instances

Co-locating compute-bound finetuning tasks with memory-bound decoding

Managing memory constraints and interference during resource sharing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Co-locates finetuning tasks with LLM decoding

Uses unified memory allocator for runtime reuse

Employs latency predictor and QoS scheduler

🔎 Similar Papers

Balancing Speciality and Versatility: A Coarse to Fine Framework for Mitigating Catastrophic Forgetting in Large Language Models