Serving Large Language Models on Huawei CloudMatrix384

πŸ“… 2025-06-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address severe challenges in compute capacity, memory bandwidth, and communication latency for large-model inference services, this paper introduces CloudMatrix384β€”an end-to-end AI infrastructure super-nodeβ€”and CloudMatrix-Infer, its dedicated inference system. Methodologically, it proposes a novel peer-to-peer service architecture enabling independent, elastic scaling across prefill, decode, and KV-cache paths; pioneers an EP320-level expert-parallelism strategy; and establishes a hardware-aware optimization framework driven by a Unified Bus (UB) interconnect network. Built on an Ascend 910C+NPU/Kunpeng CPU heterogeneous cluster, it integrates full-mesh UB interconnection, INT8 quantization, micro-batch pipelined scheduling, and custom operators. Evaluation on DeepSeek-R1 shows 6,688 tokens/s/NPU (prefill) and 1,943 tokens/s/NPU (decode, TPOT <50 ms); under stringent 15-ms latency constraints, it sustains 538 tokens/s with zero accuracy degradation.

Technology Category

Application Category

πŸ“ Abstract
The rapid evolution of large language models (LLMs), driven by growing parameter scales, adoption of mixture-of-experts (MoE) architectures, and expanding context lengths, imposes unprecedented demands on AI infrastructure. Traditional AI clusters face limitations in compute intensity, memory bandwidth, inter-chip communication, and latency, compounded by variable workloads and strict service-level objectives. Addressing these issues requires fundamentally redesigned hardware-software integration. This paper introduces Huawei CloudMatrix, a next-generation AI datacenter architecture, realized in the production-grade CloudMatrix384 supernode. It integrates 384 Ascend 910C NPUs and 192 Kunpeng CPUs interconnected via an ultra-high-bandwidth Unified Bus (UB) network, enabling direct all-to-all communication and dynamic pooling of resources. These features optimize performance for communication-intensive operations, such as large-scale MoE expert parallelism and distributed key-value cache access. To fully leverage CloudMatrix384, we propose CloudMatrix-Infer, an advanced LLM serving solution incorporating three core innovations: a peer-to-peer serving architecture that independently scales prefill, decode, and caching; a large-scale expert parallelism strategy supporting EP320 via efficient UB-based token dispatch; and hardware-aware optimizations including specialized operators, microbatch-based pipelining, and INT8 quantization. Evaluation with the DeepSeek-R1 model shows CloudMatrix-Infer achieves state-of-the-art efficiency: prefill throughput of 6,688 tokens/s per NPU and decode throughput of 1,943 tokens/s per NPU (<50 ms TPOT). It effectively balances throughput and latency, sustaining 538 tokens/s even under stringent 15 ms latency constraints, while INT8 quantization maintains model accuracy across benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Addressing AI infrastructure limitations for large language models
Optimizing performance for communication-intensive LLM operations
Balancing throughput and latency in LLM serving solutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Peer-to-peer serving architecture scales components independently
Large-scale expert parallelism supports EP320 via UB token dispatch
Hardware-aware optimizations include INT8 quantization and pipelining
πŸ”Ž Similar Papers
No similar papers found.
Pengfei Zuo
Pengfei Zuo
Huawei
AI InfrastructureCloud InfrastructureMachine Learning SystemsMemory SystemsStorage Systems
H
Huimin Lin
Huawei
J
Junbo Deng
Huawei
N
Nan Zou
Huawei
X
Xingkun Yang
Huawei
Y
Yingyu Diao
Huawei
W
Weifeng Gao
Huawei
K
Ke Xu
Huawei
Z
Zhangyu Chen
Huawei
S
Shirui Lu
Huawei
Z
Zhao Qiu
Huawei
P
Peiyang Li
Huawei
X
Xianyu Chang
Huawei
Z
Zhengzhong Yu
Huawei
F
Fangzheng Miao
Huawei
J
Jia Zheng
Huawei
Y
Ying Li
Huawei
Y
Yuan Feng
Huawei
B
Bei Wang
Huawei
Z
Zaijian Zong
Huawei
M
Mosong Zhou
Huawei
W
Wenli Zhou
SiliconFlow
H
Houjiang Chen
SiliconFlow
Xingyu Liao
Xingyu Liao
SiliconFlow
Y
Yipeng Li
SiliconFlow
W
Wenxiao Zhang
SiliconFlow
Ping Zhu
Ping Zhu
SiliconFlow
Y
Yinggang Wang
SiliconFlow
C
Chuanjie Xiao
SiliconFlow
D
Depeng Liang
SiliconFlow
Dong Cao
Dong Cao
SiliconFlow
J
Juncheng Liu
SiliconFlow
Yongqiang Yang
Yongqiang Yang
Huawei Cloud
δΊ‘η½‘η»œγ€εˆ†εΈƒεΌη³»η»Ÿ
X
Xiaolong Bai
Huawei
Y
Yi Li
Huawei
H
Huaguo Xie
Huawei
H
Huatao Wu
Huawei
Zhibin Yu
Zhibin Yu
Shenzhen Institute of Advanced Technology, Chinese Academy of Science
Computer ArchitectureComputer Systems
L
Lv Chen
Huawei
H
Hu Liu
Huawei
Y
Yujun Ding
Huawei
H
Haipei Zhu
Huawei
J
Jing Xia
Huawei
Y
Yi Xiong
Huawei
Z
Zhou Yu
Huawei
H
Heng Liao
Huawei