AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration

📅 2025-04-27

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

To address high SSD access latency and redundant CPU intervention in large-model and graph-computation workloads under GPU memory constraints, this paper proposes a lightweight asynchronous GPU-direct NVMe I/O scheme. Our method eliminates CPU involvement by enabling GPU kernels to directly issue and manage NVMe I/O requests, integrates an HBM-based flexible software cache for fine-grained data prefetching and replacement, and achieves deep overlap between computation and I/O at the CUDA stream level using low-overhead synchronization primitives that provably avoid deadlock. Key contributions include: (1) the first deadlock-free asynchronous GPU-NVMe I/O model; (2) a hardware-efficient software cache leveraging high-bandwidth memory; and (3) a stream-aware, lock-free coordination mechanism. Evaluation shows 1.75× end-to-end speedup over the state-of-the-art BaM system on DLRM, with 2.85× lower NVMe request overhead, 3.12× reduced cache management cost, and 1.32× fewer register usages.

Technology Category

Application Category

📝 Abstract

Graphics Processing Units (GPUs) have become essential for computationally intensive applications. However, emerging workloads such as recommender systems, graph analytics, and data analytics often involve processing data exceeding GPU on-chip memory capacity. To mitigate this issue, existing solutions enable GPUs to use CPU DRAM or SSDs as external memory. Among them, the GPU-centric approach lets GPU threads directly initiate NVMe requests, eliminating CPU intervention overhead over traditional methods. However, the SOTA GPU-centric approach adopts a synchronous IO model, and threads must tolerate the long latency in communication before starting any tasks. In this work, we propose AGILE, a lightweight and efficient asynchronous library allowing GPU threads to access SSDs asynchronously while eliminating deadlock risks. AGILE also integrates a flexible software cache using GPU High-Bandwidth Mamory (HBM). We demonstrate that the asynchronous GPU-centric IO achieves up to 1.88$ imes$ improvement in workloads with different computation-to-communication (CTC) ratios. We also compare AGILE with the SOTA work BaM on Deep Learning Recommendation Models (DLRM) with various settings, and the results show that AGILE achieves 1.75$ imes$ performance improvement due to its efficient design and the overlapping strategy enabled by an asynchronous IO model. We further evaluate AGILE's API overhead on graph applications, and the results demonstrate AGILE reduces software cache overhead by up to 3.12$ imes$ and overhead in NVMe IO requests by up to 2.85$ imes$. Compared with BaM, AGILE consumes fewer registers and exhibits up to 1.32$ imes$ reduction in the usage of registers.

Problem

Research questions and friction points this paper is trying to address.

Enables asynchronous GPU-SSD access to reduce latency

Improves performance via lightweight GPU-centric IO design

Reduces software cache and NVMe request overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous GPU-SSD access eliminates CPU overhead

Flexible software cache using GPU HBM

Overlapping strategy improves performance significantly

🔎 Similar Papers

SSDTrain: An Activation Offloading Framework to SSDs for Faster Large Language Model Training