🤖 AI Summary
LLM inference is constrained by GPU memory capacity; bursty request arrivals cause resource contention and increased latency, while existing admission control strategies exhibit sluggish responsiveness and pre-scheduling suffers from high paging overhead, degrading throughput. This paper proposes the first GPU memory management framework integrating time-slice preemptive scheduling with network coordination. It introduces a novel RDMA-accelerated inference state paging mechanism, synergistically combining GPU memory virtualization, dynamic time-slice scheduling, and asynchronous paging I/O optimization to enable low-overhead offloading and fine-grained preemption. Evaluated on an 8×H100 cluster, our approach achieves a 20× improvement in end-to-end latency over state-of-the-art methods and quadruples per-GPU throughput for long-context prompts. It significantly enhances service efficiency and scalability of multimodal large models under high load.
📝 Abstract
Inference on large-language models (LLMs) is constrained by GPU memory capacity. A sudden increase in the number of inference requests to a cloud-hosted LLM can deplete GPU memory, leading to contention between multiple prompts for limited resources. Modern LLM serving engines deal with the challenge of limited GPU memory using admission control, which causes them to be unresponsive during request bursts. We propose that preemptive scheduling of prompts in time slices is essential for ensuring responsive LLM inference, especially under conditions of high load and limited GPU memory. However, preempting prompt inference incurs a high paging overhead, which reduces inference throughput. We present Aqua, a GPU memory management framework that significantly reduces the overhead of paging inference state, achieving both responsive and high-throughput inference even under bursty request patterns. We evaluate Aqua by hosting several state-of-the-art large generative ML models of different modalities on servers with 8 Nvidia H100 80G GPUs. Aqua improves the responsiveness of LLM inference by 20X compared to the state-of-the-art and improves LLM inference throughput over a single long prompt by 4X.