DiT-HC: Enabling Efficient Training of Visual Generation Model DiT on HPC-oriented CPU Cluster

📅 2026-01-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of efficiently training large-scale vision generative models—specifically Diffusion Transformers (DiT)—on CPU-based high-performance computing (HPC) clusters, where high communication overhead, complex memory management, and operator performance bottlenecks hinder scalability. The authors present the first successful large-scale DiT training on HPC CPU systems, introducing Communication-Free Tensor Parallelism (CFTP), AutoMem for automated memory scheduling, optimized HCOps kernels, and a customized MPI backend, while fully leveraging high-bandwidth memory and matrix acceleration units. Their approach achieves 8.2–87.7× speedup over existing CPU-based solutions and demonstrates 90.6% weak scaling efficiency on 256 nodes, significantly advancing the integration of AI and scientific computing.

Technology Category

Application Category

📝 Abstract
Generative foundation models have become an important tool for data reconstruction and simulation in scientific computing, showing a tight integration with traditional numerical simulations. At the same time, with the development of new hardware features, such as matrix acceleration units and high-bandwidth memory, CPU-based clusters offer promising opportunities to accelerate and scale such models, facilitating the unification of artificial intelligence and scientific computing. We present DiT-HC, the first system to train and scale the generative model DiT on a next-generation HPC CPU cluster. DiT-HC introduces three key techniques: (1) communication-free tensor parallelism (CFTP) with AutoMem for automated memory-aware dataflow, (2) HCOps, a suite of optimized GEMM and operator kernels leveraging vector and matrix acceleration units, and (3) a custom MPI backend that overlaps computation, communication, and memory movement. Experiments show 8.2 to 87.7 times speedups over native or public CPU libraries and 90.6% weak scaling efficiency on 256 nodes. These results demonstrate the feasibility of large-scale generative model training on CPU clusters and provide new insights for future HPC-AI co-design.
Problem

Research questions and friction points this paper is trying to address.

DiT
HPC
CPU cluster
generative model training
efficient training
Innovation

Methods, ideas, or system contributions that make the work stand out.

DiT-HC
communication-free tensor parallelism
HCOps
CPU-based HPC
generative model training
🔎 Similar Papers
No similar papers found.
J
Jinxiao Zhang
Department of Earth System Science, Tsinghua University, Beijing, China
Y
Yunpu Xu
Institute of Data and Information, Tsinghua Shenzhen International Graduate School, Shenzhen, China
X
Xiyong Wu
Institute of Data and Information, Tsinghua Shenzhen International Graduate School, Shenzhen, China
R
Runmin Dong
School of Artificial Intelligence, Sun Yat-sen University, Zhuhai, China
Shenggan Cheng
Shenggan Cheng
National University of Singapore
Machine Learning SystemsHigh Performance ComputingDeep Learning
Yi Zhao
Yi Zhao
Tsinghua University
HPC
Mengxuan Chen
Mengxuan Chen
Tsinghua University
AI4Sciencemachine learningearth system model
Q
Qinrui Zheng
National Supercomputing Center in Shenzhen, Shenzhen, China
J
Jianting Liu
National Supercomputing Center in Shenzhen, Shenzhen, China
Haohuan Fu
Haohuan Fu
Tsinghua University