🤖 AI Summary
Diffusion Transformers (DiTs) suffer from quadratic training cost scaling with sequence length, hindering large-scale pretraining. Existing token pruning methods either degrade representation quality at high sparsity or incur parameter redundancy and poor generalization. This paper proposes SPRINT: a shallow-dense–deep-sparse residual fusion architecture, coupled with two-stage training (masked pretraining followed by full-token fine-tuning) and Path-Drop Guidance (PDG) at inference—enabling dynamic path selection that bridges the train-inference discrepancy while preserving fine-grained details. On ImageNet-1K, SPRINT achieves 9.8× training speedup while matching the FID/FDD of full-token baselines; it reduces inference FLOPs by 47% and even surpasses baseline generation quality. Key innovations include residual sparse-dense feature co-modeling and a train-inference-consistent dynamic path optimization mechanism.
📝 Abstract
Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive. Token dropping can reduce training cost, yet naïve strategies degrade representations, and existing methods are either parameter-heavy or fail at high drop ratios. We present SPRINT, Sparse--Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while preserving quality. SPRINT leverages the complementary roles of shallow and deep layers: early layers process all tokens to capture local detail, deeper layers operate on a sparse subset to cut computation, and their outputs are fused through residual connections. Training follows a two-stage schedule: long masked pre-training for efficiency followed by short full-token fine-tuning to close the train--inference gap. On ImageNet-1K 256x256, SPRINT achieves 9.8x training savings with comparable FID/FDD, and at inference, its Path-Drop Guidance (PDG) nearly halves FLOPs while improving quality. These results establish SPRINT as a simple, effective, and general solution for efficient DiT training.