Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism

📅 2026-05-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
This work addresses the challenge of efficiently serving multi-tenant LLM requests with heterogeneous latency requirements—such as interactive and background tasks—under fixed GPU resources, aiming to maximize effective throughput while meeting Tail Time-to-First-Token (TTFT) and Time-per-Output-Token (TPOT) service level objectives (SLOs). To this end, it introduces tensor parallelism (TP) as a first-class runtime control dimension and proposes a co-design of TP-aware weight reuse, fast KV cache migration, dynamic GPU resource partitioning, and multi-tier SLO-aware scheduling. These mechanisms jointly optimize resource allocation and request scheduling across both prefill and decode phases. Experimental results on realistic workloads demonstrate that the proposed approach achieves up to 5.3× higher SLO-compliant effective throughput compared to the state-of-the-art systems.
📝 Abstract
LLM serving is increasingly multi-tenant: the same deployment must handle latency-critical interactive requests and more relaxed background workloads under a fixed GPU budget. This creates a tiered-SLO setting where maximizing overall goodput (requests that satisfy both TTFT and TPOT targets) is challenging because workload mix, request lengths, and load intensity vary over time. Existing systems mainly optimize request-level controls (e.g., queuing and batching) while keeping execution configuration largely static, which limits adaptation under multi-tier contention. We present Nitsum, a distributed LLM serving system that treats tensor parallelism (TP) as a first-class runtime control surface rather than a static deployment choice. Nitsum jointly optimizes TP level, prefill/decode GPU split, and request scheduling. To make frequent TP adaptation practical, Nitsum introduces TP-aware weight reuse and fast KV migration. Experiments on real traces and targeted microbenchmarks show that Nitsum improves SLO-compliant goodput over SoTA by up to 5.3 times.
Problem

Research questions and friction points this paper is trying to address.

LLM serving
tiered SLO
multi-tenant
tensor parallelism
goodput
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Tensor Parallelism
Tiered SLO
LLM Serving
KV Cache Migration
Runtime Resource Optimization
🔎 Similar Papers
No similar papers found.