RTP-LLM: High-Performance Alibaba LLM Inference Engine

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses critical bottlenecks in deploying large language models (LLMs) in industrial settings—namely, inefficient inference, slow loading, high latency, and suboptimal resource utilization—by introducing a unified, high-efficiency inference engine. The proposed system accelerates model loading through sequential I/O optimization and overlapping I/O with communication, and incorporates a decoupled Prefill-Decode architecture, hierarchical KV cache reuse, modular speculative decoding, adaptive quantization, multimodal decoupling, and multi-level parallelism. Evaluated across models ranging from 8B to 235B parameters, it significantly outperforms vLLM and SGLang: model loading is accelerated by 4.7–6.3×, P95 time-to-first-token (TTFT) latency is reduced by 35–37%, KV cache reuse improves by 215%, and throughput gains of 1.12–2.48× and 1.86–2.52× are achieved for speculative decoding and multimodal tasks, respectively. Additionally, quantized batched inference latency is lowered by 35–40%, and TTFT is improved by 1.9–3.0×.

📝 Abstract

Large Language Models (LLMs) have revolutionized AI applications, but deploying them at scale presents significant challenges. We present RTP-LLM, a high-performance inference engine for industrial-scale LLM deployment, successfully deployed across Alibaba Group serving over 100 million users. RTP-LLM addresses fundamental bottlenecks through integrated design. It optimizes model loading via file-order-driven I/O and parallel I/O-communication overlapping. The Prefill-Decode Disaggregation architecture decouples compute-intensive prefill from memory-bound decode phases, combined with hierarchical multi-tiered KV cache management enabling efficient cache reuse. In addition, RTP-LLM incorporates modular speculative decoding supporting multiple algorithms, adaptive KV cache quantization, and decoupled multimodal processing, with support for multi-level parallelism. Comprehensive evaluations across diverse model architectures (8B-235B parameters) have been conducted, where both controlled benchmarks and real production workloads are used. The results demonstrate RTP-LLM's superior performance against vLLM and SGLang: 4.7x-6.3x model loading speedup, 35-37% TTFT P95 latency reduction with 215% cache reuse improvement in production traffic scheduling, 1.12x-2.48x and 1.86x-2.52x throughput improvements in speculative decoding and multimodal inference, respectively, and 35-40% batch latency reduction with 1.9x-3.0x TTFT improvement in quantized inference. RTP-LLM's production-proven architecture and open-source availability make it a comprehensive solution for industrial LLM deployment.

Problem

Research questions and friction points this paper is trying to address.

LLM inference

model deployment

KV cache management

speculative decoding

multimodal processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prefill-Decode Disaggregation

KV Cache Management

Speculative Decoding