🤖 AI Summary
In PCIe-connected multi-GPU systems, traditional layer-wise offloading struggles to hide prefetching latency and suffers from bandwidth contention between prefetching and collective communication, leading to performance bottlenecks. This work formulates offloading as a joint scheduling problem of prefetching and communication and introduces ChunkFlow, a runtime system that leverages a first-order analytical model to predict computation windows capable of hiding prefetch latency. ChunkFlow employs chunk-granularity prefetching and an adaptive communication yielding strategy to dynamically coordinate bandwidth allocation, achieving a smooth trade-off between GPU memory usage and prefetching overhead. Experiments on a dual-H100 PCIe system show that ChunkFlow achieves up to 1.28× speedup over SGLang and, compared to a no-offloading baseline, reduces GPU memory consumption by 49% under large workloads with nearly unchanged step time, while incurring near-zero scheduling overhead for small workloads.
📝 Abstract
Layerwise offloading reduces the GPU memory footprint of large diffusion transformer (DiT) inference by prefetching upcoming layers from host memory, but its effectiveness hinges on hiding prefetch latency behind per-layer computation. This assumption breaks down when the per-GPU compute workload is small. Moreover, on PCIe-only nodes, prefetch and inter-GPU collective communications such as all-reduce and all-to-all contend on the shared PCIe path, exposing prefetch latency even when compute would otherwise hide it. We revisit layerwise offloading as a co-scheduling problem between prefetch and communication, guided by a first-order analytical model that predicts when prefetch can be hidden by computation. Building on this model, we design ChunkFlow, a communication-aware, chunk-granular offloading runtime that adaptively yields to collective communication and smoothly trades GPU memory for prefetch volume. On three representative diffusion transformers running on two H100 GPUs over PCIe with Ulysses sequence parallelism, ChunkFlow delivers up to 1.28x step-time speedup over SGLang's existing layerwise offloading, reduces peak GPU memory by up to 49% over the no-offload baseline at near-identical step time once the workload is large enough, and exposes a tunable memory-latency tradeoff that recovers near-zero step-time overhead in the small-workload regime.