Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

This work addresses the challenge of low GPU utilization in large model inference services caused by traffic bursts and the limitations of conventional online–offline colocation, which suffers from frequent, high-latency preemptions and complex system modifications. The authors propose Valve, a system that enables efficient colocation by jointly constraining preemption latency and frequency while preserving online task performance. Valve is the first to support, in production, at most one sub-millisecond computational preemption per request along with rate limiting. It builds a lightweight GPU runtime through channel-controlled compute isolation, page-fault-free memory reclamation, and dynamic memory reservation, requiring minimal modifications to drivers and frameworks. Deployed on a cluster of 8,054 GPUs, Valve improves utilization by 34.6%, saving 2,170 GPUs, while increasing online tasks’ time-to-first-token and per-token latency by less than 5% and 2%, respectively.

Technology Category

Application Category

📝 Abstract

LLM inference powers latency-critical production services nowadays. The bursty nature of inference traffic results in over-provisioning, which in turn leads to resource underutilization. While online-offline colocation promises to utilize idle capacity, broad production deployment must overcome two major challenges: (i) large online interference due to slow or frequent preemptions, and (ii) extensive frameworks and drivers modifications, to colocate different models and support preemptions. We present Valve, a production-friendly colocation system that jointly bounds preemption latency and preemption rate. Specifically, Valve enables sub-millisecond compute preemption at most once per online request, and rate-limited sub-layer memory reclamation. These guaranties are provided by a GPU runtime that combines channel-controlled compute isolation, page-fault-free memory reclamation, and dynamic memory reservation. Critically, Valve is practical to deploy, requiring one line of driver modification and 20 lines of framework patch. Deployed on 8,054 GPUs in production, Valve improves cluster utilization by 34.6%, which translates to a 2,170 GPU save. This efficiency gains is achieved with minimal online interference, incurring <5% TTFT increase and <2% TPOT increase across workloads.

Problem

Research questions and friction points this paper is trying to address.

online-offline colocation

preemption latency

resource underutilization

LLM inference

production deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

online-offline colocation

preemption latency

preemption rate