RTP-LLM: High-Performance Alibaba LLM Inference Engine

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses critical bottlenecks in deploying large language models (LLMs) in industrial settings—namely, inefficient inference, slow loading, high latency, and suboptimal resource utilization—by introducing a unified, high-efficiency inference engine. The proposed system accelerates model loading through sequential I/O optimization and overlapping I/O with communication, and incorporates a decoupled Prefill-Decode architecture, hierarchical KV cache reuse, modular speculative decoding, adaptive quantization, multimodal decoupling, and multi-level parallelism. Evaluated across models ranging from 8B to 235B parameters, it significantly outperforms vLLM and SGLang: model loading is accelerated by 4.7–6.3×, P95 time-to-first-token (TTFT) latency is reduced by 35–37%, KV cache reuse improves by 215%, and throughput gains of 1.12–2.48× and 1.86–2.52× are achieved for speculative decoding and multimodal tasks, respectively. Additionally, quantized batched inference latency is lowered by 35–40%, and TTFT is improved by 1.9–3.0×.
📝 Abstract
Large Language Models (LLMs) have revolutionized AI applications, but deploying them at scale presents significant challenges. We present RTP-LLM, a high-performance inference engine for industrial-scale LLM deployment, successfully deployed across Alibaba Group serving over 100 million users. RTP-LLM addresses fundamental bottlenecks through integrated design. It optimizes model loading via file-order-driven I/O and parallel I/O-communication overlapping. The Prefill-Decode Disaggregation architecture decouples compute-intensive prefill from memory-bound decode phases, combined with hierarchical multi-tiered KV cache management enabling efficient cache reuse. In addition, RTP-LLM incorporates modular speculative decoding supporting multiple algorithms, adaptive KV cache quantization, and decoupled multimodal processing, with support for multi-level parallelism. Comprehensive evaluations across diverse model architectures (8B-235B parameters) have been conducted, where both controlled benchmarks and real production workloads are used. The results demonstrate RTP-LLM's superior performance against vLLM and SGLang: 4.7x-6.3x model loading speedup, 35-37% TTFT P95 latency reduction with 215% cache reuse improvement in production traffic scheduling, 1.12x-2.48x and 1.86x-2.52x throughput improvements in speculative decoding and multimodal inference, respectively, and 35-40% batch latency reduction with 1.9x-3.0x TTFT improvement in quantized inference. RTP-LLM's production-proven architecture and open-source availability make it a comprehensive solution for industrial LLM deployment.
Problem

Research questions and friction points this paper is trying to address.

LLM inference
model deployment
KV cache management
speculative decoding
multimodal processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prefill-Decode Disaggregation
KV Cache Management
Speculative Decoding
Adaptive Quantization
Multi-tiered Parallelism
🔎 Similar Papers
No similar papers found.
B
Boyu Tan
Alibaba Group
Jiarui Guo
Jiarui Guo
Peking University
Z
Zongwei Lv
Peking University
Hanbo Sun
Hanbo Sun
Tsinghua University
High Performance Computing
Tong Yang
Tong Yang
Peking University, Beijing, China. PKU. 北京大学
SketchNetwork measurementBloom filterIP lookupHash Table
K
Kan Liu
Alibaba Group
X
Xinfei Shi
Alibaba Group
Z
Zetao Hu
Alibaba Group
Y
Yaxin Yu
Alibaba Group
C
Chi Zhang
Alibaba Group
J
Jianning Zhang
Alibaba Group
X
Xi Yang
Alibaba Group
Wei Zhang
Wei Zhang
Alibaba Inc, Past Amazon.com, Microsoft, University of Illinois at Chicago, Xian Jiaotong University
Cloud computingAutoMLRecommendationTargeted ads
B
Bo Cai
Alibaba Group
S
Silu Zhou
Alibaba Group
X
Xiyu Wang
Alibaba Group
N
Na He
Alibaba Group
Yinghao Yu
Yinghao Yu
Engineer, Alibaba
Resource management in containerized clustersGeneration optimizations for distributed systems
W
Wending Bao
Alibaba Group
G
Guiyang Huang
Alibaba Group
Y
Yuxing Yuan
Alibaba Group
J
Juncheng Yin
Alibaba Group
N
Nan Wang
Alibaba Group
L
Lin Yang
Alibaba Group
Z
Zechao Zhang
Alibaba Group