CFP: Low-overhead Profiling-based Intra-operator Parallelism Generation by Preserving Communication-Free Structures

📅 2025-04-01

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Operator-level parallelism configuration search in large model training suffers from explosive search spaces, high communication overhead, and a trade-off between automation and efficiency. Method: This paper proposes a lightweight runtime performance profiling–based search framework. Its core innovation is the first formal definition of “ParallelBlock” and its sequential structure, which identifies communication-free computational units to enable communication-free propagation of input tensor partitioning. It reduces combinatorial search to input partition enumeration and further compresses the search space via cross-Block behavioral clustering. The method integrates tensor partition propagation analysis, communication-freedom verification, piecewise performance modeling, and combinatorial performance prediction. Contribution/Results: On GPT, LLaMA, and MoE models, it achieves 1.51×, 1.31×, and 3.43× training speedup over Alpa, respectively, while significantly reducing configuration overhead and search time.

Technology Category

Application Category

📝 Abstract

This paper introduces CFP, a system that search intra-operator parallelism configurations by leveraging runtime profiles of actual parallel programs. The key idea is to profile a limited space by identifying a new structure named ParallelBlock, which is a group of operators with the property of communication-free tensor partition propagation: the partition of its input tensor can propagate through all operators to its output tensor without introducing communication or synchronization. Based on this property, an optimal tensor partition of operators within a ParallelBlock should be inferred from the partition of input tensor through partition propagation to prevent the avoidable communication. Thus, the search space can be reduced by only profiling each ParallelBlock with different input tensor partitions at its entry, instead of enumerating all combinations among operators within the ParallelBlock. Moreover, the search space is further reduced by identifying ParallelBlock sequences (segments) with similar parallel behavior. CFP computes the overall performance of the model based on the profiles of all segments. On GPT, LLAMA, and MoE models, CFP achieves up to a 1.51x, 1.31x, and 3.43x speedup over the state-of-the-art framework, Alpa.

Problem

Research questions and friction points this paper is trying to address.

Reducing intra-operator parallelism search space using ParallelBlock profiling

Optimizing tensor partition propagation to minimize communication overhead

Improving model performance by identifying ParallelBlock sequences with similar behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Profiles runtime for intra-operator parallelism configurations

Uses ParallelBlock to reduce communication overhead

Identifies ParallelBlock sequences to optimize performance

🔎 Similar Papers

Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization