🤖 AI Summary
This work addresses the limitations of existing graph Transformer training, which is typically confined to a single GPU and suffers from memory overflow and low training efficiency when scaling to large graphs. The authors propose a distributed training framework that, for the first time, enables adaptive selection of parallelization strategies tailored to both graph structure and hardware characteristics. By integrating distributed sparse operations with optimized graph attention mechanisms, the framework dynamically identifies and applies the most efficient parallelization scheme. Experimental results demonstrate that the proposed approach achieves up to 3.8× speedup in sparse graph attention computation, reduces memory consumption by up to 78%, and delivers up to 6× end-to-end training acceleration on an 8-GPU system, substantially enhancing the scalability and practicality of graph Transformers.
📝 Abstract
Graph foundation models have demonstrated remarkable adaptability across diverse downstream tasks through large-scale pretraining on graphs. However, existing implementations of the backbone model, graph transformers, are typically limited to single-GPU systems, leading to long training times or out-of-memory issues on large graphs. Moreover, parallelizing graph transformer training over the full graph is challenging, as efficiency depends heavily on both the graph structure and system characteristics, such as bandwidth and memory capacity.
In this work, we introduce a distributed training framework for graph transformers, which automatically selects and optimizes parallelization strategies based on the graph structure and hardware configuration. With our implementation of distributed sparse operations, we accelerate sparse graph attention by up to 3.8x and reduce memory consumption by 78% compared to state-of-the-art frameworks. On large graph benchmarks, our proposed framework achieves up to 6x speedup with system scaling up to 8 GPUs. These results demonstrate that the proposed framework improves the scalability of graph transformers, bringing them closer to serving as practical graph foundation models.