🤖 AI Summary
To address GPU memory capacity and bandwidth limitations, as well as low integration efficiency of heterogeneous storage (DRAM/SSD), this paper proposes a CXL 3.0–based GPU memory expansion architecture. The method introduces three key innovations: (1) a silicon-verified RTL-level CXL controller enabling unified management of heterogeneous memory across multiple root ports; (2) hardware-coordinated memory semantic extensions and a low-latency interconnect design achieving sub-100 ns round-trip latency; and (3) a speculative read and deterministic write mechanism that effectively masks backend media latency variability. Experimental evaluation demonstrates that, compared to state-of-the-art GPU memory expansion approaches, the proposed architecture achieves significantly higher bandwidth and reduces latency by an order of magnitude. It delivers a high-bandwidth, low-latency, and scalable unified memory space, directly supporting large-model training and high-performance computing workloads.
📝 Abstract
This work introduces a GPU storage expansion solution utilizing CXL, featuring a novel GPU system design with multiple CXL root ports for integrating diverse storage media (DRAMs and/or SSDs). We developed and siliconized a custom CXL controller integrated at the hardware RTL level, achieving two-digit nanosecond roundtrip latency, the first in the field. This study also includes speculative read and deterministic store mechanisms to efficiently manage read and write operations to hide the endpoint's backend media latency variation. Performance evaluations reveal our approach significantly outperforms existing methods, marking a substantial advancement in GPU storage technology.