xLLM Technical Report

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low throughput, suboptimal resource utilization, and poor cross-accelerator portability of LLM inference frameworks in enterprise-scale deployment, this paper proposes xLLM—a high-performance inference framework featuring decoupled service and engine layers. Its core innovations include: (1) a dynamic Prefill-Decode scheduling strategy and an Encode-Prefill-Decode separation mechanism to enable elastic multimodal request orchestration and global KV cache sharing; and (2) system-algorithm co-optimizations—namely, multi-level execution pipelining, adaptive graph compilation, xTensor memory management, optimized speculative decoding, and dynamic EPLB. Experiments under identical time-per-output-token (TPOT) constraints show that xLLM achieves up to 1.7× and 2.2× higher throughput than MindIE and vLLM-Ascend on Qwen models, respectively, and delivers an average 1.7× improvement on Deepseek models. These advances significantly enhance cluster resource utilization and service availability.

Technology Category

Application Category

📝 Abstract
We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework designed for high-performance, large-scale enterprise-grade serving, with deep optimizations for diverse AI accelerators. To address these challenges, xLLM builds a novel decoupled service-engine architecture. At the service layer, xLLM-Service features an intelligent scheduling module that efficiently processes multimodal requests and co-locates online and offline tasks through unified elastic scheduling to maximize cluster utilization. This module also relies on a workload-adaptive dynamic Prefill-Decode (PD) disaggregation policy and a novel Encode-Prefill-Decode (EPD) disaggregation policy designed for multimodal inputs. Furthermore, it incorporates a distributed architecture to provide global KV Cache management and robust fault-tolerant capabilities for high availability. At the engine layer, xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources. This is achieved through comprehensive multi-layer execution pipeline optimizations, an adaptive graph mode and an xTensor memory management. xLLM-Engine also further integrates algorithmic enhancements such as optimized speculative decoding and dynamic EPLB, collectively serving to substantially boost throughput and inference efficiency. Extensive evaluations demonstrate that xLLM delivers significantly superior performance and resource efficiency. Under identical TPOT constraints, xLLM achieves throughput up to 1.7x that of MindIE and 2.2x that of vLLM-Ascend with Qwen-series models, while maintaining an average throughput of 1.7x that of MindIE with Deepseek-series models. xLLM framework is publicly available at https://github.com/jd-opensource/xllm and https://github.com/jd-opensource/xllm-service.
Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM inference for high-performance enterprise-grade serving
Enhancing cluster utilization through intelligent scheduling of multimodal requests
Boosting throughput and efficiency via system-algorithm co-optimization techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled service-engine architecture for LLM inference
Intelligent scheduling with dynamic disaggregation policies
Multi-layer pipeline optimizations and memory management
🔎 Similar Papers
No similar papers found.
Tongxuan Liu
Tongxuan Liu
University of Science and Technology of China
LLM Logic ReasoningMulti-AgentsLLM Inference SystemLVLMRecommender System
Tao Peng
Tao Peng
吉林大学
natural language processingknowledge graph
P
Peijun Yang
JD.com
Xiaoyang Zhao
Xiaoyang Zhao
JD.com
Xiusheng Lu
Xiusheng Lu
USTC
W
Weizhe Huang
JD.com
Zirui Liu
Zirui Liu
Peking University
SystemsAlgorithmsData Structures
X
Xiaoyu Chen
JD.com
Z
Zhiwei Liang
JD.com
Jun Xiong
Jun Xiong
JD.com
D
Donghe Jin
JD.com
M
Minchao Zhang
JD.com
J
Jinrong Guo
JD.com
Y
Yingxu Deng
JD.com
X
Xu Zhang
JD.com
X
Xianzhe Dong
BUAA
S
Siqi Wang
BUAA
S
Siyu Wu
BUAA
Yu Wu
Yu Wu
University of Cambridge
machine learninghealth sensingmobile health
Z
Zihan Tang
USTC
Y
Yuting Zeng
PKU
Yanshu Wang
Yanshu Wang
Tsinghua University
Computer&Economy
J
Jinguang Liu
JD.com
M
Meng Kang
JD.com
M
Menxin Li
JD.com