🤖 AI Summary
Operator-level parallelism configuration search in large model training suffers from explosive search spaces, high communication overhead, and a trade-off between automation and efficiency. Method: This paper proposes a lightweight runtime performance profiling–based search framework. Its core innovation is the first formal definition of “ParallelBlock” and its sequential structure, which identifies communication-free computational units to enable communication-free propagation of input tensor partitioning. It reduces combinatorial search to input partition enumeration and further compresses the search space via cross-Block behavioral clustering. The method integrates tensor partition propagation analysis, communication-freedom verification, piecewise performance modeling, and combinatorial performance prediction. Contribution/Results: On GPT, LLaMA, and MoE models, it achieves 1.51×, 1.31×, and 3.43× training speedup over Alpa, respectively, while significantly reducing configuration overhead and search time.
📝 Abstract
This paper introduces CFP, a system that search intra-operator parallelism configurations by leveraging runtime profiles of actual parallel programs. The key idea is to profile a limited space by identifying a new structure named ParallelBlock, which is a group of operators with the property of communication-free tensor partition propagation: the partition of its input tensor can propagate through all operators to its output tensor without introducing communication or synchronization. Based on this property, an optimal tensor partition of operators within a ParallelBlock should be inferred from the partition of input tensor through partition propagation to prevent the avoidable communication. Thus, the search space can be reduced by only profiling each ParallelBlock with different input tensor partitions at its entry, instead of enumerating all combinations among operators within the ParallelBlock. Moreover, the search space is further reduced by identifying ParallelBlock sequences (segments) with similar parallel behavior. CFP computes the overall performance of the model based on the profiles of all segments. On GPT, LLAMA, and MoE models, CFP achieves up to a 1.51x, 1.31x, and 3.43x speedup over the state-of-the-art framework, Alpa.