🤖 AI Summary
This work addresses the high latency and poor compute-communication overlap in DiT inference under Ulysses sequence parallelism, primarily caused by frequent all-to-all communications. To overcome this, the authors propose three key innovations: Tile-Aware Parallel All-to-all (TAPA), a V-First computation-communication scheduling strategy, and selective V-Major communication leveraging tensor redundancy across denoising steps. These are further integrated with RoPE- and normalization-aware pipeline optimizations to effectively hide and compress communication overhead for the first time in DiT inference. Evaluated on the Aurora supercomputer across 1–8 nodes (up to 96 Intel GPU tiles) on four DiT models, the approach achieves an average speedup of 3.6×, with peak acceleration reaching 8.4×.
📝 Abstract
Diffusion Transformers (DiTs) are increasingly adopted in scientific computing, yet growing model sizes and resolutions make distributed multi-GPU inference essential. Ulysses sequence parallelism scales DiT inference but introduces frequent all-to-all collectives that dominate latency. Overlapping these with computation is difficult due to tight data dependencies, large message volumes, and asymmetric interconnect bandwidths.
We introduce CoCoDiff, a distributed DiT inference engine exploiting two observations: (1) V requires only linear projection while Q/K need additional normalization and RoPE, creating opportunities to overlap V's communication with Q/K computation; (2) adjacent denoising steps produce similar tensors, yielding temporal redundancy. CoCoDiff introduces three mechanisms: Tile-Aware Parallel All-to-all (TAPA) decomposes collectives into topology-aligned phases; V-First scheduling hides V's communication behind Q/K computation; and V-Major selective communication transmits only active projections on slow interconnects. On the Aurora supercomputer with four DiT models across 1-8 nodes (up to 96 Intel GPU tiles), CoCoDiff achieves an average speedup of 3.6x, peaking at 8.4x.