🤖 AI Summary
Existing distributed matrix multiplication algorithms support only limited partitioning schemes; mismatched configurations necessitate redundant data redistribution, incurring substantial communication overhead. Method: We propose a universal one-sided algorithm that unifies arbitrary tiling strategies and replication factors via slice-index arithmetic, eliminating dependence on specialized algorithm implementations. Leveraging a high-order C++ PGAS framework, it integrates GPU-to-GPU direct communication and intra-node high-speed interconnects, dynamically generating and reordering local compute tasks to maximize computation-communication overlap. Contribution/Results: This is the first implementation to natively support all data distribution patterns within a single codebase, significantly improving system flexibility and maintainability. Experimental evaluation demonstrates performance competitive with PyTorch DTensor across diverse tiling and replication configurations, validating its efficiency and generality for AI training and scientific computing workloads.
📝 Abstract
Many important applications across science, data analytics, and AI workloads depend on distributed matrix multiplication. Prior work has developed a large array of algorithms suitable for different problem sizes and partitionings including 1D, 2D, 1.5D, and 2.5D algorithms. A limitation of current work is that existing algorithms are limited to a subset of partitionings. Multiple algorithm implementations are required to support the full space of possible partitionings. If no algorithm implementation is available for a particular set of partitionings, one or more operands must be redistributed, increasing communication costs. This paper presents a universal one-sided algorithm for distributed matrix multiplication that supports all combinations of partitionings and replication factors. Our algorithm uses slicing (index arithmetic) to compute the sets of overlapping tiles that must be multiplied together. This list of local matrix multiplies can then either be executed directly, or reordered and lowered to an optimized IR to maximize overlap. We implement our algorithm using a high-level C++-based PGAS programming framework that performs direct GPU-to-GPU communication using intra-node interconnects. We evaluate performance for a wide variety of partitionings and replication factors, finding that our work is competitive with PyTorch DTensor, a highly optimized distributed tensor library targeting AI models.