🤖 AI Summary
This work addresses the lack of lightweight, portable operating system–level programming abstractions for near-data processing (NDP) accelerators, which hinders efficient utilization of disaggregated memory systems such as CXL. The authors present the first adaptation of Unix-like processes and pipeline abstractions to resource-constrained NDP hardware, introducing a programming model based on lightweight virtual processors and share-nothing buffer-based inter-process communication (IPC). By integrating compile-time optimizations with a customized interconnect protocol, the approach enables low-latency CPU–accelerator communication. Experimental evaluation on real hardware demonstrates significant performance improvements over CPU-only baselines across diverse workloads—including bulk memory operations, in-memory databases, and graph analytics—effectively overcoming the performance bottlenecks of conventional IPC mechanisms in disaggregated memory environments.
📝 Abstract
The use of disaggregated or far memory systems such as CXL memory pools has renewed interest in Near-Data Processing (NDP): situating cores close to memory to reduce bandwidth requirements to and from the CPU. Hardware designs for such accelerators are appearing, but there lack clean, portable OS abstractions for programming them.
We propose a programming model for NDP devices based on familiar OS abstractions: virtual processors (processes) and inter-process communication channels (like Unix pipes).
While appealing from a user perspective, a naive implementation of such abstractions is inappropriate for NDP accelerators: the paucity of processing power in some hardware designs makes classical processes overly heavyweight, and IPC based on shared buffers makes no sense in a system designed to reduce memory bandwidth.
Accordingly, we show how to implement these abstractions in a lightweight and efficient manner by exploiting compilation and interconnect protocols. We demonstrate them with a real hardware platform runing applications with a range of memory access patterns, including bulk memory operations, in-memory databases and graph applications.
Crucially, we show not only the benefits over CPU-only implementations, but also the critical importance of efficient, low-latency communication channels between CPU and NDP accelerators, a feature largely neglected in existing proposals.