D-PDLP: Scaling PDLP to Distributed Multi-GPU Systems

📅 2026-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a distributed primal-dual hybrid gradient (PDHG) framework tailored for multi-GPU systems to overcome the memory and computational limitations of single-GPU implementations in solving industrial-scale linear programs (LPs). By partitioning the constraint matrix over a two-dimensional grid and integrating sparsity-aware data distribution with block-level randomized reshuffling, the approach achieves co-optimized scaling of computation and memory usage. The design maintains FP64 numerical precision while significantly improving load balancing and scalability. Leveraging efficient NCCL-based communication and fused CUDA kernels, the implementation successfully surpasses single-GPU memory constraints and demonstrates strong scalability and superior solution performance on standard LP benchmarks—including MIPLIB and Hans—as well as real-world large-scale datasets.

Technology Category

Application Category

📝 Abstract
We present a distributed framework of the Primal-Dual Hybrid Gradient (PDHG) algorithm for solving massive-scale linear programming (LP) problems. Although PDHG-based solvers demonstrate strong performance on single-node GPU architectures, their applicability to industrial-scale instances is often limited by single-GPU computational throughput. To overcome these challenges, we propose D-PDLP, the first Distributed PDLP framework, which extends PDHG to a multi-GPU setting via a practical two-dimensional grid partitioning of the constraint matrix. To improve load balance and computational efficiency, we introduce a block-wise random permutation strategy combined with nonzero-aware matrix partitioning. By distributing the intensive computation required in PDHG iterations, the proposed framework harnesses multi-GPU parallelism to achieve substantial speedups with relatively low communication overhead. Extensive experiments on standard LP benchmarks (including MIPLIB and Mittelmann instances) as well as huge-scale real-world datasets show that our distributed implementation, built upon cuPDLPx, achieves strong scalability and high performance while preserving full FP64 numerical accuracy.
Problem

Research questions and friction points this paper is trying to address.

Linear Programming
PDHG
Distributed Multi-GPU
Scalability
Memory Bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

distributed PDHG
multi-GPU linear programming
2D matrix partitioning
nonzero-aware data distribution
fused CUDA kernels
🔎 Similar Papers
No similar papers found.