KnapFormer: An Online Load Balancer for Efficient Diffusion Transformers Training

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion Transformers (DiTs) suffer from severe inter-GPU workload imbalance—causing straggler effects—during mixed-resolution image-video joint training and variable-length text input, due to dynamic fluctuations in visual and textual token counts. Method: We propose the first online scheduling framework integrating sequence parallelism with dynamic load balancing. It jointly optimizes sequence partitioning and token redistribution by formulating the problem as a distributed knapsack optimization that minimizes cross-GPU workload variance. Leveraging DeepSpeed-Ulysses for low-overhead token reallocation, we combine a lightweight semi-empirical workload model with distributed metadata collection. Contribution/Results: Evaluated on state-of-the-art DiT models (e.g., FLUX), our framework achieves 2–3× training speedup with GPU workload variance below 1%, significantly enhancing training efficiency on large-scale heterogeneous multimodal datasets.

Technology Category

Application Category

📝 Abstract
We present KnapFormer, an efficient and versatile framework to combine workload balancing and sequence parallelism in distributed training of Diffusion Transformers (DiT). KnapFormer builds on the insight that strong synergy exists between sequence parallelism and the need to address the significant token imbalance across ranks. This imbalance arises from variable-length text inputs and varying visual token counts in mixed-resolution and image-video joint training. KnapFormer redistributes tokens by first gathering sequence length metadata across all ranks in a balancing group and solving a global knapsack problem. The solver aims to minimize the variances of total workload per-GPU, while accounting for the effect of sequence parallelism. By integrating DeepSpeed-Ulysees-based sequence parallelism in the load-balancing decision process and utilizing a simple semi-empirical workload model, KnapFormers achieves minimal communication overhead and less than 1% workload discrepancy in real-world training workloads with sequence length varying from a few hundred to tens of thousands. It eliminates straggler effects and achieves 2x to 3x speedup when training state-of-the-art diffusion models like FLUX on mixed-resolution and image-video joint data corpora. We open-source the KnapFormer implementation at https://github.com/Kai-46/KnapFormer/
Problem

Research questions and friction points this paper is trying to address.

Addresses token imbalance in distributed DiT training
Minimizes per-GPU workload variance via knapsack optimization
Eliminates stragglers for 2-3x faster mixed-resolution training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Global knapsack solver redistributes tokens
Integrates sequence parallelism into load balancing
Semi-empirical workload model minimizes communication overhead
🔎 Similar Papers
No similar papers found.