🤖 AI Summary
This work addresses the challenges of efficiently training large-scale vision generative models—specifically Diffusion Transformers (DiT)—on CPU-based high-performance computing (HPC) clusters, where high communication overhead, complex memory management, and operator performance bottlenecks hinder scalability. The authors present the first successful large-scale DiT training on HPC CPU systems, introducing Communication-Free Tensor Parallelism (CFTP), AutoMem for automated memory scheduling, optimized HCOps kernels, and a customized MPI backend, while fully leveraging high-bandwidth memory and matrix acceleration units. Their approach achieves 8.2–87.7× speedup over existing CPU-based solutions and demonstrates 90.6% weak scaling efficiency on 256 nodes, significantly advancing the integration of AI and scientific computing.
📝 Abstract
Generative foundation models have become an important tool for data reconstruction and simulation in scientific computing, showing a tight integration with traditional numerical simulations. At the same time, with the development of new hardware features, such as matrix acceleration units and high-bandwidth memory, CPU-based clusters offer promising opportunities to accelerate and scale such models, facilitating the unification of artificial intelligence and scientific computing. We present DiT-HC, the first system to train and scale the generative model DiT on a next-generation HPC CPU cluster. DiT-HC introduces three key techniques: (1) communication-free tensor parallelism (CFTP) with AutoMem for automated memory-aware dataflow, (2) HCOps, a suite of optimized GEMM and operator kernels leveraging vector and matrix acceleration units, and (3) a custom MPI backend that overlaps computation, communication, and memory movement. Experiments show 8.2 to 87.7 times speedups over native or public CPU libraries and 90.6% weak scaling efficiency on 256 nodes. These results demonstrate the feasibility of large-scale generative model training on CPU clusters and provide new insights for future HPC-AI co-design.