Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the inefficiency of static prefetching strategies in distributed graph neural network (GNN) training, which fail to adapt to dynamic changes in graph data distribution and sampling parameters, leading to excessive communication overhead. To overcome this limitation, the study introduces a novel approach that leverages the zero-shot in-context learning capability of large language models (LLMs) to construct an intelligent agent for dynamically prefetching remote nodes, thereby significantly reducing communication latency. Implemented within the AWS DistDGL framework, the method exploits the LLM’s multi-step reasoning ability to adaptively adjust prefetching policies, surpassing the constraints of conventional heuristic or machine learning–based techniques. Experiments on the NERSC Perlmutter supercomputer demonstrate that the proposed approach improves training performance by 91% and 82% over no-prefetching and static-prefetching baselines, respectively, while reducing communication volume by more than 50%.

Technology Category

Application Category

📝 Abstract

Large-scale Graph Neural Networks (GNNs) are typically trained by sampling a vertex's neighbors to a fixed distance. Because large input graphs are distributed, training requires frequent irregular communication that stalls forward progress. Moreover, fetched data changes with graph, graph distribution, sample and batch parameters, and caching polices. Consequently, any static prefetching method will miss crucial opportunities to adapt to different dynamic conditions. In this paper, we introduce Rudder, a software module embedded in the state-of-the-art AWS DistDGL framework, to autonomously prefetch remote nodes and minimize communication. Rudder's adaptation contrasts with both standard heuristics and traditional ML classifiers. We observe that the generative AI found in contemporary Large Language Models (LLMs) exhibits emergent properties like In-Context Learning (ICL) for zero-shot tasks, with logical multi-step reasoning. We find this behavior well-suited for adaptive control even with substantial undertraining. Evaluations using standard datasets and unseen configurations on the NERSC Perlmutter supercomputer show up to 91% improvement in end-to-end training performance over baseline DistDGL (no prefetching), and an 82% improvement over static prefetching, reducing communication by over 50%. Our code is available at https://github.com/aishwaryyasarkar/rudder-llm-agent.

Problem

Research questions and friction points this paper is trying to address.

Distributed GNN Training

Prefetching

Irregular Communication

Dynamic Data Access

Graph Neural Networks

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM agent

adaptive prefetching

distributed GNN training