gpuRDF2vec -- Scalable GPU-based RDF2vec

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low efficiency and poor scalability of knowledge graph (KG) embedding generation on web-scale KGs, this paper introduces the first GPU-accelerated RDF2vec framework. The method comprises three key innovations: (1) a GPU-parallelized random walk algorithm that significantly accelerates walk extraction—especially for long walks and dense graphs; (2) a multi-node collaborative architecture enabling distributed processing of large-scale RDF graphs; and (3) integration of PyTorch Lightning to support efficient, fault-tolerant distributed word2vec training. Experiments on both synthetic and real-world datasets demonstrate that the single-node walk generation phase substantially outperforms pyRDF2vec, SparkKGML, and jRDF2vec. Moreover, the end-to-end pipeline generates high-quality KG embeddings within practical timeframes. This work establishes a scalable, GPU-native paradigm for large-scale semantic representation learning.

Technology Category

Application Category

📝 Abstract
Generating Knowledge Graph (KG) embeddings at web scale remains challenging. Among existing techniques, RDF2vec combines effectiveness with strong scalability. We present gpuRDF2vec, an open source library that harnesses modern GPUs and supports multi-node execution to accelerate every stage of the RDF2vec pipeline. Extensive experiments on both synthetically generated graphs and real-world benchmarks show that gpuRDF2vec achieves up to a substantial speedup over the currently fastest alternative, i.e., jRDF2vec. In a single-node setup, our walk-extraction phase alone outperforms pyRDF2vec, SparkKGML, and jRDF2vec by a substantial margin using random walks on large/ dense graphs, and scales very well to longer walks, which typically lead to better quality embeddings. Our implementation of gpuRDF2vec enables practitioners and researchers to train high-quality KG embeddings on large-scale graphs within practical time budgets and builds on top of Pytorch Lightning for the scalable word2vec implementation.
Problem

Research questions and friction points this paper is trying to address.

Accelerating RDF2vec pipeline using GPUs and multi-node execution
Improving scalability for large and dense knowledge graphs
Enabling high-quality KG embeddings within practical time budgets
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU-accelerated RDF2vec pipeline
Multi-node execution support
Pytorch Lightning-based scalable word2vec
🔎 Similar Papers
No similar papers found.
M
Martin Böckling
Data and Web Science Group, University of Mannheim, Mannheim 68160, Germany
Heiko Paulheim
Heiko Paulheim
Professor for Data Science at University of Mannheim, Germany
Knowledge GraphsSemantic WebData MiningMachine Learningdws@uma