Taiji: A DPU Memory Elasticity Solution for In-production Cloud Environments

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of prolonged hardware upgrade cycles and constrained memory resources for Data Processing Units (DPUs) in cloud data centers, this paper proposes a production-ready memory-elastic architecture. Our approach refactors the DPU operating system into a user-space, swap-out-capable system, integrating hybrid virtualization and parallel memory swapping to enable lightweight virtualization, memory overcommitment, and full-memory-level hot upgrades. The key innovation is the first demonstration of near-complete dynamic memory swapping for DPUs—enabling hot upgrades and elastic resource scaling without disrupting upper-layer services. Experimental results show over 50% effective memory capacity increase, virtualization overhead of only ~5%, and 90% of memory page-in operations completing within 10 μs. Deployed across more than 30,000 servers in large-scale production environments, the solution achieves high density, low overhead, and high reliability.

Technology Category

Application Category

📝 Abstract
The growth of cloud computing drives data centers toward higher density and efficiency. Data processing units (DPUs) enhance server network and storage performance but face challenges such as long hardware upgrade cycles and limited resources. To address these, we propose Taiji, a resource-elasticity architecture for DPUs. Combining hybrid virtualization with parallel memory swapping, Taiji switches the DPU's operating system (OS) into a guest OS and inserts a lightweight virtualization layer, making nearly all DPU memory swappable. It achieves memory overcommitment for the switched guest OS via high-performance memory elasticity, fully transparent to upper-layer applications, and supports hot-switch and hot-upgrade to meet in-production cloud requirements. Experiments show that Taiji expands DPU memory resources by over 50%, maintains virtualization overhead around 5%, and ensures 90% of swap-ins complete within 10 microseconds. Taiji delivers an efficient, reliable, low-overhead elasticity solution for DPUs and is deployed in large-scale production systems across more than 30,000 servers.
Problem

Research questions and friction points this paper is trying to address.

Enables memory overcommitment for DPUs via virtualization
Achieves high-performance memory elasticity transparent to applications
Supports hot-switch and hot-upgrade in production cloud environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid virtualization enables DPU memory elasticity
Parallel memory swapping achieves transparent overcommitment
Hot-switch capability supports in-production cloud upgrades
🔎 Similar Papers
No similar papers found.
H
Hao Zheng
Alibaba Cloud, Hangzhou,, China
Longxiang Wang
Longxiang Wang
PhD student, City University of Hong Kong
Large language modelEncrypted database
Yun Xu
Yun Xu
School of Computer Science, University of Science and Technology of China
Parallel ComputingBioinformatic Algorithms
Q
Qiang Wang
School of Computer Science, Xi’an Jiaotong University, Xi’an,, China
Y
Yibin Shen
Alibaba Cloud, Hangzhou,, China
X
Xiaoshe Dong
School of Computer Science, Xi’an Jiaotong University, Xi’an,, China
Bang Di
Bang Di
Alibaba Cloud, Hangzhou,, China
J
Jia Wei
School of Computer Science, Xi’an Jiaotong University, Xi’an,, China
S
Shenyu Dong
Alibaba Cloud, Hangzhou,, China
X
Xingjun Zhang
School of Computer Science, Xi’an Jiaotong University, Xi’an,, China
W
Weichen Chen
Alibaba Cloud, Hangzhou,, China
Zhao Han
Zhao Han
Assistant Professor, University of South Florida
Human-Robot InteractionAugmented RealityRobot ExplainabilityRoboticsArtificial Intelligence
S
Sanqian Zhao
Alibaba Cloud, Hangzhou,, China
D
Dongdong Huang
Alibaba Cloud, Hangzhou,, China
Jie Qi
Jie Qi
MIT Media Lab
Y
Yifang Yang
Alibaba Cloud, Hangzhou,, China
Z
Zhao Gao
Alibaba Cloud, Hangzhou,, China
Y
Yi Wang
Alibaba Cloud, Hangzhou,, China
J
Jinhu Li
Alibaba Cloud, Hangzhou,, China
X
Xudong Ren
Alibaba Cloud, Hangzhou,, China
M
Min He
Alibaba Cloud, Hangzhou,, China
H
Hang Yang
Alibaba Cloud, Hangzhou,, China
Xiao Zheng
Xiao Zheng
Alibaba Cloud, Hangzhou,, China
H
Haijiao Hao
Alibaba Cloud, Hangzhou,, China
J
Jiesheng Wu
Alibaba Cloud, Hangzhou,, China