🤖 AI Summary
This work addresses critical bottlenecks in deploying large language models (LLMs) in industrial settings—namely, inefficient inference, slow loading, high latency, and suboptimal resource utilization—by introducing a unified, high-efficiency inference engine. The proposed system accelerates model loading through sequential I/O optimization and overlapping I/O with communication, and incorporates a decoupled Prefill-Decode architecture, hierarchical KV cache reuse, modular speculative decoding, adaptive quantization, multimodal decoupling, and multi-level parallelism. Evaluated across models ranging from 8B to 235B parameters, it significantly outperforms vLLM and SGLang: model loading is accelerated by 4.7–6.3×, P95 time-to-first-token (TTFT) latency is reduced by 35–37%, KV cache reuse improves by 215%, and throughput gains of 1.12–2.48× and 1.86–2.52× are achieved for speculative decoding and multimodal tasks, respectively. Additionally, quantized batched inference latency is lowered by 35–40%, and TTFT is improved by 1.9–3.0×.
📝 Abstract
Large Language Models (LLMs) have revolutionized AI applications, but deploying them at scale presents significant challenges. We present RTP-LLM, a high-performance inference engine for industrial-scale LLM deployment, successfully deployed across Alibaba Group serving over 100 million users. RTP-LLM addresses fundamental bottlenecks through integrated design. It optimizes model loading via file-order-driven I/O and parallel I/O-communication overlapping. The Prefill-Decode Disaggregation architecture decouples compute-intensive prefill from memory-bound decode phases, combined with hierarchical multi-tiered KV cache management enabling efficient cache reuse. In addition, RTP-LLM incorporates modular speculative decoding supporting multiple algorithms, adaptive KV cache quantization, and decoupled multimodal processing, with support for multi-level parallelism.
Comprehensive evaluations across diverse model architectures (8B-235B parameters) have been conducted, where both controlled benchmarks and real production workloads are used. The results demonstrate RTP-LLM's superior performance against vLLM and SGLang: 4.7x-6.3x model loading speedup, 35-37% TTFT P95 latency reduction with 215% cache reuse improvement in production traffic scheduling, 1.12x-2.48x and 1.86x-2.52x throughput improvements in speculative decoding and multimodal inference, respectively, and 35-40% batch latency reduction with 1.9x-3.0x TTFT improvement in quantized inference. RTP-LLM's production-proven architecture and open-source availability make it a comprehensive solution for industrial LLM deployment.