Vmem: A Lightweight Hot-Upgradable Memory Management for In-production Cloud Environment

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional memory management in cloud environments suffers from high metadata overhead, architectural complexity, and poor stability; existing software- and hardware-based optimizations struggle to simultaneously achieve flexibility and low overhead. This paper proposes Vmem, a lightweight, online-upgradable memory management architecture. Vmem is the first production-ready solution enabling hot upgrades of the memory subsystem. It integrates lightweight reserved-memory management, VFIO-accelerated virtual machines, DPU-assisted offloading, and a dynamic upgrade mechanism. Experiments show that Vmem increases sellable memory ratio by ~2%, accelerates VFIO VM startup by over 3×, and improves VM network performance by ~10% under DPU acceleration. Deployed at scale across more than 300,000 cloud servers, Vmem robustly supports elastic scaling and rapid iteration requirements.

Technology Category

Application Category

📝 Abstract
Traditional memory management suffers from metadata overhead, architectural complexity, and stability degradation, problems intensified in cloud environments. Existing software/hardware optimizations are insufficient for cloud computing's dual demands of flexibility and low overhead. This paper presents Vmem, a memory management architecture for in-production cloud environments that enables flexible, efficient cloud server memory utilization through lightweight reserved memory management. Vmem is the first such architecture to support online upgrades, meeting cloud requirements for high stability and rapid iterative evolution. Experiments show Vmem increases sellable memory rate by about 2%, delivers extreme elasticity and performance, achieves over 3x faster boot time for VFIO-based virtual machines (VMs), and improves network performance by about 10% for DPU-accelerated VMs. Vmem has been deployed at large scale for seven years, demonstrating efficiency and stability on over 300,000 cloud servers supporting hundreds of millions of VMs.
Problem

Research questions and friction points this paper is trying to address.

Reducing memory metadata overhead in cloud environments
Enabling online upgrades for memory management systems
Improving cloud server performance and elasticity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight reserved memory management for cloud servers
First memory architecture supporting online upgrades
Improves VM boot time and network performance
🔎 Similar Papers
No similar papers found.
H
Hao Zheng
Alibaba Cloud, Hangzhou,, China
Q
Qiang Wang
School of Computer Science, Xi’an Jiaotong University, Xi’an,, China
Longxiang Wang
Longxiang Wang
PhD student, City University of Hong Kong
Large language modelEncrypted database
X
Xishi Qiu
Alibaba Cloud, Hangzhou,, China
Y
Yibin Shen
Alibaba Cloud, Hangzhou,, China
X
Xiaoshe Dong
School of Computer Science, Xi’an Jiaotong University, Xi’an,, China
N
Naixuan Guan
Alibaba Cloud, Hangzhou,, China
J
Jia Wei
School of Computer Science, Xi’an Jiaotong University, Xi’an,, China
F
Fudong Qiu
Alibaba Cloud, Hangzhou,, China
X
Xingjun Zhang
School of Computer Science, Xi’an Jiaotong University, Xi’an,, China
Yun Xu
Yun Xu
School of Computer Science, University of Science and Technology of China
Parallel ComputingBioinformatic Algorithms
M
Mao Zhao
School of Computer Science, Xi’an Jiaotong University, Xi’an,, China
Y
Yisheng Xie
Alibaba Cloud, Hangzhou,, China
S
Shenglong Zhao
Alibaba Cloud, Hangzhou,, China
M
Min He
Alibaba Cloud, Hangzhou,, China
Y
Yu Li
Alibaba Cloud, Hangzhou,, China
Xiao Zheng
Xiao Zheng
Alibaba Cloud, Hangzhou,, China
B
Ben Luo
Alibaba Cloud, Hangzhou,, China
J
Jiesheng Wu
Alibaba Cloud, Hangzhou,, China