🤖 AI Summary
In NUMA systems, poor coordination between the CPU scheduler and memory manager leads to thread–page-table placement mismatches. This paper proposes a hardware–software co-design optimization framework to address this issue. Our approach introduces: (1) the first joint thread and page-table placement mechanism; (2) differentiated migration and on-demand copying policies for data pages versus page-table pages; and (3) memory bandwidth throttling coupled with hardware performance counter–driven QoS feedback control to suppress cross-socket coherence overhead. Implemented as a Linux kernel module, the solution requires no application modifications. Experimental evaluation demonstrates that, compared to state-of-the-art approaches, our method reduces CPU execution cycles by 2.09× and page-table traversal overhead by 1.58×, significantly improving NUMA locality and scalability.
📝 Abstract
The emergence of symmetric multi-processing (SMP) systems with non-uniform memory access (NUMA) has prompted extensive research on process and data placement to mitigate the performance impact of NUMA on applications. However, existing solutions often overlook the coordination between the CPU scheduler and memory manager, leading to inefficient thread and page table placement. Moreover, replication techniques employed to improve locality suffer from redundant replicas, scalability barriers, and performance degradation due to memory bandwidth and inter-socket interference. In this paper, we present Phoenix, a novel integrated CPU scheduler and memory manager with on-demand page table replication mechanism. Phoenix integrates the CPU scheduler and memory management subsystems, allowing for coordinated thread and page table placement. By differentiating between data and page table pages, Phoenix enables direct migration or replication of page tables based on application behavior. Additionally, Phoenix employs memory bandwidth management mechanism to maintain Quality of Service (QoS) while mitigating coherency maintenance overhead. We implemented Phoenix as a loadable kernel module for Linux, ensuring compatibility with legacy applications and ease of deployment. Our evaluation on real hardware demonstrates that Phoenix reduces CPU cycles by 2.09x and page-walk cycles by 1.58x compared to state-of-the-art solutions.