🤖 AI Summary
To address the low efficiency and poor scalability of knowledge graph (KG) embedding generation on web-scale KGs, this paper introduces the first GPU-accelerated RDF2vec framework. The method comprises three key innovations: (1) a GPU-parallelized random walk algorithm that significantly accelerates walk extraction—especially for long walks and dense graphs; (2) a multi-node collaborative architecture enabling distributed processing of large-scale RDF graphs; and (3) integration of PyTorch Lightning to support efficient, fault-tolerant distributed word2vec training. Experiments on both synthetic and real-world datasets demonstrate that the single-node walk generation phase substantially outperforms pyRDF2vec, SparkKGML, and jRDF2vec. Moreover, the end-to-end pipeline generates high-quality KG embeddings within practical timeframes. This work establishes a scalable, GPU-native paradigm for large-scale semantic representation learning.
📝 Abstract
Generating Knowledge Graph (KG) embeddings at web scale remains challenging. Among existing techniques, RDF2vec combines effectiveness with strong scalability. We present gpuRDF2vec, an open source library that harnesses modern GPUs and supports multi-node execution to accelerate every stage of the RDF2vec pipeline. Extensive experiments on both synthetically generated graphs and real-world benchmarks show that gpuRDF2vec achieves up to a substantial speedup over the currently fastest alternative, i.e., jRDF2vec. In a single-node setup, our walk-extraction phase alone outperforms pyRDF2vec, SparkKGML, and jRDF2vec by a substantial margin using random walks on large/ dense graphs, and scales very well to longer walks, which typically lead to better quality embeddings. Our implementation of gpuRDF2vec enables practitioners and researchers to train high-quality KG embeddings on large-scale graphs within practical time budgets and builds on top of Pytorch Lightning for the scalable word2vec implementation.