DiT-HC: Enabling Efficient Training of Visual Generation Model DiT on HPC-oriented CPU Cluster

📅 2026-01-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This work addresses the challenges of efficiently training large-scale vision generative models—specifically Diffusion Transformers (DiT)—on CPU-based high-performance computing (HPC) clusters, where high communication overhead, complex memory management, and operator performance bottlenecks hinder scalability. The authors present the first successful large-scale DiT training on HPC CPU systems, introducing Communication-Free Tensor Parallelism (CFTP), AutoMem for automated memory scheduling, optimized HCOps kernels, and a customized MPI backend, while fully leveraging high-bandwidth memory and matrix acceleration units. Their approach achieves 8.2–87.7× speedup over existing CPU-based solutions and demonstrates 90.6% weak scaling efficiency on 256 nodes, significantly advancing the integration of AI and scientific computing.

Technology Category

Application Category

📝 Abstract

Generative foundation models have become an important tool for data reconstruction and simulation in scientific computing, showing a tight integration with traditional numerical simulations. At the same time, with the development of new hardware features, such as matrix acceleration units and high-bandwidth memory, CPU-based clusters offer promising opportunities to accelerate and scale such models, facilitating the unification of artificial intelligence and scientific computing. We present DiT-HC, the first system to train and scale the generative model DiT on a next-generation HPC CPU cluster. DiT-HC introduces three key techniques: (1) communication-free tensor parallelism (CFTP) with AutoMem for automated memory-aware dataflow, (2) HCOps, a suite of optimized GEMM and operator kernels leveraging vector and matrix acceleration units, and (3) a custom MPI backend that overlaps computation, communication, and memory movement. Experiments show 8.2 to 87.7 times speedups over native or public CPU libraries and 90.6% weak scaling efficiency on 256 nodes. These results demonstrate the feasibility of large-scale generative model training on CPU clusters and provide new insights for future HPC-AI co-design.

Problem

Research questions and friction points this paper is trying to address.

DiT

HPC

CPU cluster

generative model training

efficient training

Innovation

Methods, ideas, or system contributions that make the work stand out.

DiT-HC

communication-free tensor parallelism

HCOps