🤖 AI Summary
This work proposes a distributed primal-dual hybrid gradient (PDHG) framework tailored for multi-GPU systems to overcome the memory and computational limitations of single-GPU implementations in solving industrial-scale linear programs (LPs). By partitioning the constraint matrix over a two-dimensional grid and integrating sparsity-aware data distribution with block-level randomized reshuffling, the approach achieves co-optimized scaling of computation and memory usage. The design maintains FP64 numerical precision while significantly improving load balancing and scalability. Leveraging efficient NCCL-based communication and fused CUDA kernels, the implementation successfully surpasses single-GPU memory constraints and demonstrates strong scalability and superior solution performance on standard LP benchmarks—including MIPLIB and Hans—as well as real-world large-scale datasets.
📝 Abstract
We present a distributed framework of the Primal-Dual Hybrid Gradient (PDHG) algorithm for solving massive-scale linear programming (LP) problems. Although PDHG-based solvers demonstrate strong performance on single-node GPU architectures, their applicability to industrial-scale instances is often limited by single-GPU computational throughput. To overcome these challenges, we propose D-PDLP, the first Distributed PDLP framework, which extends PDHG to a multi-GPU setting via a practical two-dimensional grid partitioning of the constraint matrix. To improve load balance and computational efficiency, we introduce a block-wise random permutation strategy combined with nonzero-aware matrix partitioning. By distributing the intensive computation required in PDHG iterations, the proposed framework harnesses multi-GPU parallelism to achieve substantial speedups with relatively low communication overhead. Extensive experiments on standard LP benchmarks (including MIPLIB and Mittelmann instances) as well as huge-scale real-world datasets show that our distributed implementation, built upon cuPDLPx, achieves strong scalability and high performance while preserving full FP64 numerical accuracy.