Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion Transformers (DiTs) suffer from quadratic training cost scaling with sequence length, hindering large-scale pretraining. Existing token pruning methods either degrade representation quality at high sparsity or incur parameter redundancy and poor generalization. This paper proposes SPRINT: a shallow-dense–deep-sparse residual fusion architecture, coupled with two-stage training (masked pretraining followed by full-token fine-tuning) and Path-Drop Guidance (PDG) at inference—enabling dynamic path selection that bridges the train-inference discrepancy while preserving fine-grained details. On ImageNet-1K, SPRINT achieves 9.8× training speedup while matching the FID/FDD of full-token baselines; it reduces inference FLOPs by 47% and even surpasses baseline generation quality. Key innovations include residual sparse-dense feature co-modeling and a train-inference-consistent dynamic path optimization mechanism.

Technology Category

Application Category

📝 Abstract
Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive. Token dropping can reduce training cost, yet naïve strategies degrade representations, and existing methods are either parameter-heavy or fail at high drop ratios. We present SPRINT, Sparse--Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while preserving quality. SPRINT leverages the complementary roles of shallow and deep layers: early layers process all tokens to capture local detail, deeper layers operate on a sparse subset to cut computation, and their outputs are fused through residual connections. Training follows a two-stage schedule: long masked pre-training for efficiency followed by short full-token fine-tuning to close the train--inference gap. On ImageNet-1K 256x256, SPRINT achieves 9.8x training savings with comparable FID/FDD, and at inference, its Path-Drop Guidance (PDG) nearly halves FLOPs while improving quality. These results establish SPRINT as a simple, effective, and general solution for efficient DiT training.
Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic training cost of Diffusion Transformers
Enables aggressive token dropping while preserving quality
Fuses sparse and dense representations through residual connections
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse-dense residual fusion for efficient transformers
Two-stage training with masked pre-training and fine-tuning
Path-Drop Guidance reduces inference FLOPs while improving quality
🔎 Similar Papers