DCI: A Coordinated Allocation and Filling Workload-Aware Dual-Cache Allocation GNN Inference Acceleration System

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address inefficiencies in sampling-based GNN inference—including redundant loading of node features and adjacency matrices, inefficient cross-memory data transfers, low GPU memory utilization, and high preprocessing overhead—this paper proposes a workload-aware dual-cache coordination mechanism. Specifically, it introduces the first dynamic cache capacity allocation strategy for node features and adjacency matrices, guided by workload characteristics observed during the pre-sampling phase. Furthermore, it integrates spatial locality modeling of adjacency matrices with a lightweight padding strategy to substantially reduce preprocessing costs. Experimental results demonstrate that our approach achieves 1.18×–11.26× speedup in end-to-end inference latency over DGL, reduces preprocessing time by 52.8%–98.7%, and compresses preprocessing latency to less than 20% of that incurred by DUCATI.

Technology Category

Application Category

📝 Abstract

Graph Neural Networks (GNNs) are powerful tools for processing graph-structured data, increasingly used for large-scale real-world graphs via sampling-based inference methods. However, inherent characteristics of neighbor sampling lead to redundant data loading during GNN inference, compounded by inefficient data transfers between host and GPU memory, resulting in slow inference and low resource utilization. Existing methods to accelerate GNN inference face several challenges: (1) low practical GPU memory utilization, (2) overlooking adjacency matrix locality, and (3) long preprocessing time. To address these challenges, we introduce DCI, an efficient workload-aware dual-cache allocation system for GNN inference acceleration. DCI allocates cache capacities for both node features and adjacency matrices based on workload patterns during the pre-sampling phase, leveraging a lightweight cache-filling algorithm to optimize data loading efficiency. Experimental results demonstrate that DCI accelerates sampling and node feature loading, achieving end-to-end inference speedups of 1.18$ imes$ to 11.26$ imes$ compared to DGL, and 1.14$ imes$ to 13.68$ imes$ over RAIN, while reducing preprocessing time by 52.8% to 98.7%. Additionally, DCI outperforms state-of-the-art single-cache inference systems by achieving speedup of 1.08$ imes$ to 1.32$ imes$. We also compared DCI with DUCATI's dual-cache population strategy. Our lightweight population algorithm allows DCI to achieve nearly the same inference speed while keeping preprocessing time to less than 20% of that required by DUCATI.

Problem

Research questions and friction points this paper is trying to address.

Reduces redundant data loading in GNN inference.

Improves GPU memory utilization and data transfer efficiency.

Minimizes preprocessing time for GNN inference acceleration.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-cache allocation optimizes GNN inference efficiency.

Workload-aware cache filling reduces redundant data loading.

Lightweight algorithm minimizes preprocessing time significantly.

🔎 Similar Papers

SSDTrain: An Activation Offloading Framework to SSDs for Faster Large Language Model Training