Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

232K/year
🤖 AI Summary
This work addresses the limitations of the conventional Prefill-Decode architecture, whose substantial KV Cache transfer overhead hinders cross-datacenter deployment and impedes resource elasticity and independent scaling of heterogeneous hardware. To overcome this, we propose Prefill-as-a-Service (PaaS), the first architecture to decouple and serviceify the prefill phase across datacenters. Through co-design of model and system optimizations—including hybrid attention mechanisms, KV Cache compression, selective offloading, bandwidth-aware scheduling, and cache-aware request placement—we drastically reduce inter-cluster communication costs. Evaluated on a 1-trillion-parameter mixture-of-experts model, PaaS achieves 54% and 32% higher service throughput compared to homogeneous and naive heterogeneous baselines, respectively, while requiring only modest cross-datacenter bandwidth.

Technology Category

Application Category

📝 Abstract
Prefill-decode (PD) disaggregation has become the standard architecture for large-scale LLM serving, but in practice its deployment boundary is still determined by KVCache transfer. In conventional dense-attention models, prefill generates huge KVCache traffics that keep prefill and decode tightly coupled within a single high-bandwidth network domain, limiting heterogeneous deployment and resource elasticity. Recent hybrid-attention architectures substantially reduce KVCache size, making cross-cluster KVCache transport increasingly plausible. However, smaller KVCache alone does not make heterogeneous cross-datacenter PD serving practical: real workloads remain bursty, request lengths are highly skewed, prefix caches are unevenly distributed, and inter-cluster bandwidth fluctuates. A naive design that fully externalizes prefill can therefore still suffer from congestion, unstable queueing, and poor utilization. We present Prefill-as-a-Service (PrfaaS), a cross-datacenter serving architecture that selectively offloads long-context prefill to standalone, compute-dense prefill clusters and transfers the resulting KVCache over commodity Ethernet to local PD clusters for decode. Rather than treating reduced KVCache as sufficient, PrfaaS combines model-side KV efficiency with system-side selective offloading, bandwidth-aware scheduling, and cache-aware request placement. This design removes the requirement that heterogeneous accelerators share the same low-latency RDMA fabric, enabling independent scaling of prefill and decode capacity across loosely coupled clusters. In a case study using an internal 1T-parameter hybrid model, a PrfaaS-augmented heterogeneous deployment achieves 54% and 32% higher serving throughput than homogeneous PD and naive heterogeneous baselines, respectively, while consuming only modest cross-datacenter bandwidth.
Problem

Research questions and friction points this paper is trying to address.

Prefill-decode disaggregation
KVCache transfer
cross-datacenter serving
heterogeneous deployment
large language model serving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prefill-as-a-Service
KVCache
cross-datacenter serving
hybrid-attention
disaggregated inference
🔎 Similar Papers
2024-10-04arXiv.orgCitations: 1