PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving

πŸ“… 2026-02-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the computational redundancy and increased tail latency in multi-model agent systems, where each language model independently performs prefilling and maintains its own KV cache. To overcome this inefficiency, the authors propose decoupling models into a frozen shared prefill module and tunable decoding modules. Under a disaggregated deployment architecture, this design enables cross-model reuse of prefill computations and KV caches. Furthermore, they introduce a heterogeneous model routing mechanism built upon vLLM. This approach achieves, for the first time, efficient sharing of the prefill stage across multiple models in serving scenarios, matching the accuracy of full fine-tuning while reducing p95 latency by 4.5Γ— and increasing throughput by 3.9Γ—.

Technology Category

Application Category

πŸ“ Abstract
Multi-agent systems increasingly orchestrate multiple specialized language models to solve complex real-world problems, often invoking them over a shared context. This execution pattern repeatedly processes the same prompt prefix across models. Consequently, each model redundantly executes the prefill stage and maintains its own key-value (KV) cache, increasing aggregate prefill load and worsening tail latency by intensifying prefill-decode interference in existing LLM serving stacks. Disaggregated serving reduces such interference by placing prefill and decode on separate GPUs, but disaggregation does not fundamentally eliminate inter-model redundancy in computation and KV storage for the same prompt. To address this issue, we propose PrefillShare, a novel algorithm that enables sharing the prefill stage across multiple models in a disaggregated setting. PrefillShare factorizes the model into prefill and decode modules, freezes the prefill module, and fine-tunes only the decode module. This design allows multiple task-specific models to share a prefill module and the KV cache generated for the same prompt. We further introduce a routing mechanism that enables effective prefill sharing across heterogeneous models in a vLLM-based disaggregated system. PrefillShare not only matches full fine-tuning accuracy on a broad range of tasks and models, but also delivers 4.5x lower p95 latency and 3.9x higher throughput in multi-model agent workloads.
Problem

Research questions and friction points this paper is trying to address.

KV cache reuse
multi-LLM serving
prefill redundancy
disaggregated serving
tail latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

PrefillShare
KV cache reuse
disaggregated serving
multi-LLM systems
prefill-decode decoupling
πŸ”Ž Similar Papers
No similar papers found.
Sunghyeon Woo
Sunghyeon Woo
μ„œμšΈλŒ€ν•™κ΅
Deep learningNeuromorphic algorithmMemory efficient trainingActivation compression
H
Hoseung Kim
NAVER Cloud
S
Sunghwan Shim
NAVER Cloud
M
Minjung Jo
NAVER Cloud
H
Hyunjoon Jeong
NAVER Cloud
J
Jeongtae Lee
NAVER Cloud
J
Joonghoon Kim
NAVER Cloud
S
Sungjae Lee
NAVER Cloud
B
Baeseong Park
NAVER Cloud
Se Jung Kwon
Se Jung Kwon
NAVER Cloud
Deep LearningDNN CompressionDiscrete Event Modeling and Simulation
Dongsoo Lee
Dongsoo Lee
NAVER Cloud
Model compressionoptimizationAI Chip Design