🤖 AI Summary
Diffusion Transformers (DiTs) suffer from severe inter-GPU workload imbalance—causing straggler effects—during mixed-resolution image-video joint training and variable-length text input, due to dynamic fluctuations in visual and textual token counts.
Method: We propose the first online scheduling framework integrating sequence parallelism with dynamic load balancing. It jointly optimizes sequence partitioning and token redistribution by formulating the problem as a distributed knapsack optimization that minimizes cross-GPU workload variance. Leveraging DeepSpeed-Ulysses for low-overhead token reallocation, we combine a lightweight semi-empirical workload model with distributed metadata collection.
Contribution/Results: Evaluated on state-of-the-art DiT models (e.g., FLUX), our framework achieves 2–3× training speedup with GPU workload variance below 1%, significantly enhancing training efficiency on large-scale heterogeneous multimodal datasets.
📝 Abstract
We present KnapFormer, an efficient and versatile framework to combine workload balancing and sequence parallelism in distributed training of Diffusion Transformers (DiT). KnapFormer builds on the insight that strong synergy exists between sequence parallelism and the need to address the significant token imbalance across ranks. This imbalance arises from variable-length text inputs and varying visual token counts in mixed-resolution and image-video joint training. KnapFormer redistributes tokens by first gathering sequence length metadata across all ranks in a balancing group and solving a global knapsack problem. The solver aims to minimize the variances of total workload per-GPU, while accounting for the effect of sequence parallelism. By integrating DeepSpeed-Ulysees-based sequence parallelism in the load-balancing decision process and utilizing a simple semi-empirical workload model, KnapFormers achieves minimal communication overhead and less than 1% workload discrepancy in real-world training workloads with sequence length varying from a few hundred to tens of thousands. It eliminates straggler effects and achieves 2x to 3x speedup when training state-of-the-art diffusion models like FLUX on mixed-resolution and image-video joint data corpora. We open-source the KnapFormer implementation at https://github.com/Kai-46/KnapFormer/