NOMAD: Generating Embeddings for Massive Distributed Graphs

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

240K/year
🤖 AI Summary
This work addresses the challenge of efficiently generating structure-preserving node embeddings for large-scale graphs with millions to billions of edges, which are constrained by memory and computational limitations on single machines. The authors propose an MPI-based distributed graph embedding framework that implements and extends the LINE proximity model. By introducing a communication-computation co-optimization strategy tailored to irregular graph partitions, the framework substantially reduces communication overhead and enhances scalability. Experiments on the NERSC Perlmutter cluster demonstrate that the proposed method achieves 10–100× speedup over multithreaded LINE and node2vec, and 35–76× speedup compared to distributed PyTorch-BigGraph (PBG), while maintaining embedding quality on par with state-of-the-art approaches. End-to-end training time is accelerated by up to 12–370× across various datasets.

Technology Category

Application Category

📝 Abstract
Successful machine learning on graphs or networks requires embeddings that not only represent nodes and edges as low-dimensional vectors but also preserve the graph structure. Established methods for generating embeddings require flexible exploration of the entire graph through repeated use of random walks that capture graph structure with samples of nodes and edges. These methods create scalability challenges for massive graphs with millions-to-billions of edges because single-node solutions have inadequate memory and processing capabilities. We present NOMAD, a distributed-memory graph embedding framework using the Message Passing Interface (MPI) for distributed graphs. NOMAD implements proximity-based models proposed in the widely popular LINE (Large-scale Information Network Embedding) algorithm. We propose several practical trade-offs to improve the scalability and communication overheads confronted by irregular and distributed graph embedding methods, catering to massive-scale graphs arising in web and science domains. NOMAD demonstrates median speedups of 10/100x on CPU-based NERSC Perlmutter cluster relative to the popular reference implementations of multi-threaded LINE and node2vec, 35-76x over distributed PBG, and competitive embedding quality relative to LINE, node2vec, and GraphVite, while yielding 12-370x end-to-end speedups on real-world graphs.
Problem

Research questions and friction points this paper is trying to address.

graph embedding
scalability
distributed graphs
massive graphs
random walks
Innovation

Methods, ideas, or system contributions that make the work stand out.

distributed graph embedding
MPI
scalability
proximity-based models
large-scale graphs
🔎 Similar Papers
No similar papers found.