ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address low GPU utilization in LLM online serving—caused by resource idleness and workload volatility—and the challenge of concurrently meeting stringent latency requirements for interactive requests (e.g., chat) and high throughput demands for offline jobs (e.g., summarization), this paper proposes the first LLM execution engine supporting fine-grained preemption. It integrates a lightweight incremental checkpointing mechanism with an adaptive offline batch scheduling policy, enabling millisecond-level reuse of GPU idle cycles under strong performance isolation. Compared to state-of-the-art online serving systems, it achieves a 2.35× throughput improvement; relative to existing co-location approaches, it reduces tail latency by 84× and significantly boosts average GPU utilization under peak load. This work is the first to enable highly elastic, strongly isolated co-serving of online and offline LLM workloads under strict latency constraints.

Technology Category

Application Category

📝 Abstract

Many applications are leveraging large language models (LLMs) for complex tasks, and they generally demand low inference latency and high serving throughput for interactive online jobs such as chatbots. However, the tight latency requirement and high load variance of applications pose challenges to serving systems in achieving high GPU utilization. Due to the high costs of scheduling and preemption, today's systems generally use separate clusters to serve online and offline inference tasks, and dedicate GPUs for online inferences to avoid interference. This approach leads to underutilized GPUs because one must reserve enough GPU resources for the peak expected load, even if the average load is low. This paper proposes to harvest stranded GPU resources for offline LLM inference tasks such as document summarization and LLM benchmarking. Unlike online inferences, these tasks usually run in a batch-processing manner with loose latency requirements, making them a good fit for stranded resources that are only available shortly. To enable safe and efficient GPU harvesting without interfering with online tasks, we built ConServe, an LLM serving system that contains (1) an execution engine that preempts running offline tasks upon the arrival of online tasks, (2) an incremental checkpointing mechanism that minimizes the amount of recomputation required by preemptions, and (3) a scheduler that adaptively batches offline tasks for higher GPU utilization. Our evaluation demonstrates that ConServe achieves strong performance isolation when co-serving online and offline tasks but at a much higher GPU utilization. When colocating practical online and offline workloads on popular models such as Llama-2-7B, ConServe achieves 2.35$ imes$ higher throughput than state-of-the-art online serving systems and reduces serving latency by 84$ imes$ compared to existing co-serving systems.

Problem

Research questions and friction points this paper is trying to address.

Achieving high GPU utilization under variable LLM serving loads

Co-serving latency-critical online and latency-tolerant offline tasks efficiently

Harvesting millisecond-level GPU idle cycles without violating latency objectives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level scheduler for latency-aware batching

Layer-wise preemption for offline task yielding

Incremental KV cache management for zero-cost preemption

🔎 Similar Papers

No similar papers found.