SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

πŸ“… 2026-03-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the low GPU utilization and limited cross-model batching in multi-LLM serving, caused by model-specific resource partitioning during decoding. The authors propose decoupling the Transformer decoder into a task-specific prefill module and a shared decode module, where only the prefill component is fine-tuned while the decode module remains frozen and shared across multiple models. Coupled with a model-agnostic routing strategy for balanced scheduling, this approach enables the first cross-model sharing of decoding execution in multi-LLM services and natively supports low-bit quantization (QSUN). Experiments show that, while maintaining accuracy comparable to full fine-tuning, the method achieves up to 2.0Γ— higher throughput on a single GPU with no more than a 5% increase in per-token latency; the quantized variant further accelerates inference by 45%.

Technology Category

Application Category

πŸ“ Abstract
In multi-model LLM serving, decode execution remains inefficient due to model-specific resource partitioning: since cross-model batching is not possible, memory-bound decoding often suffers from severe GPU underutilization, especially under skewed workloads. We propose Shared Use of Next-token Prediction (SUN), the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving. SUN decomposes a decoder-only Transformer into a prefill module and a decode module, and fine-tunes only the task-specific prefill module, enabling a frozen decode module to be shared across models. This design enables a model-agnostic decode routing policy that balances decode requests across shared workers to maximize utilization. Across diverse tasks and model families, SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers. In particular, SUN improves throughput per GPU by up to 2.0x over conventional disaggregation while keeping time-per-output-token (TPOT) within 5%. SUN inherently enables and facilitates low-bit decoding; with Quantized SUN (QSUN), it achieves a 45% speedup with comparable accuracy to SUN while preserving the benefits of shared decoding.
Problem

Research questions and friction points this paper is trying to address.

multi-LLM serving
decode inefficiency
GPU underutilization
cross-model batching
disaggregated serving
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-model sharing
disaggregated LLM serving
decode module reuse
quantized decoding
next-token prediction
πŸ”Ž Similar Papers
No similar papers found.