Understanding and Reducing Metadata-Driven Host Overheads in Sampling-Based GNN Training

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the severe host-device coordination overhead in sampling-based graph neural network (GNN) training, where metadata-driven dynamic execution makes the CPU the critical path. To overcome this bottleneck, the authors propose ZEROGNN, the first system that fully offloads metadata management and control logic to the GPU. By employing a fixed kernel launch structure and compact execution boundaries, ZEROGNN preserves execution dynamism while restoring CUDA Graph replayability, enabling an entirely GPU-resident dynamic training pipeline. Experimental results demonstrate that ZEROGNN achieves up to 5.28× end-to-end speedup, sustains near-100% GPU utilization, matches the memory efficiency of an ideal oracle-based allocation, and supports efficient multi-GPU scaling.
📝 Abstract
Modern deep learning workloads increasingly exhibit dynamic, metadata-driven execution, where runtime-generated information determines memory provisioning and kernel launch decisions. In sampling-based graph neural network (GNN) training, this behavior places the CPU on the critical path, introducing persistent host-device orchestration overhead and frequent GPU-CPU synchronization, which dominate end-to-end runtime when GPU computation is small. Existing approaches, including CUDA Graphs and GPU dynamic parallelism, fail to address this problem because the metadata-driven control loop remains host-mediated, and execution structure varies across iterations. We present ZEROGNN, a system that removes the host from the metadata-driven control loop and enables fully GPU-resident execution under dynamic behavior. ZEROGNN keeps runtime metadata on-device, mediates dynamic execution within a fixed launch structure, and provisions a conservative yet tight execution envelope to restore CUDA Graph replayability. Experiments on sampling-based GNN workloads show that ZEROGNN achieves up to 5.28 x end-to-end speedup, near 100% GPU execution fraction, and memory efficiency comparable to ideal metadata-informed allocation, while enabling strong multi-GPU scaling by eliminating host-side bottlenecks.
Problem

Research questions and friction points this paper is trying to address.

metadata-driven execution
host overhead
GNN training
GPU-CPU synchronization
sampling-based
Innovation

Methods, ideas, or system contributions that make the work stand out.

metadata-driven execution
GPU-resident execution
CUDA Graph replayability
sampling-based GNN training
host overhead reduction