🤖 AI Summary
To address the high computational cost, redundant global attention, and sluggish low-frequency modeling in diffusion transformers (DiTs) for high-resolution image generation, this paper proposes pseudo-shifted window attention (PSWA) and progressively covering channel allocation (PCCA). PSWA combines localized windowing with a pseudo-shift mechanism to preserve global receptive fields while substantially reducing redundant computation and enhancing high-frequency detail modeling. PCCA dynamically allocates channels based on inter-channel attention similarity—without introducing extra parameters or computational overhead—to improve higher-order feature collaboration. Integrated into the Swin-DiT-L architecture, our method achieves a 54% improvement in Fréchet Inception Distance (FID) over DiT-XL/2, while simultaneously reducing GPU memory consumption and inference latency. This demonstrates a favorable trade-off between generation quality and computational efficiency.
📝 Abstract
Diffusion Transformers (DiTs) achieve remarkable performance within the domain of image generation through the incorporation of the transformer architecture. Conventionally, DiTs are constructed by stacking serial isotropic global information modeling transformers, which face significant computational cost when processing high-resolution images. We empirically analyze that latent space image generation does not exhibit a strong dependence on global information as traditionally assumed. Most of the layers in the model demonstrate redundancy in global computation. In addition, conventional attention mechanisms exhibit low-frequency inertia issues. To address these issues, we propose extbf{P}seudo extbf{S}hifted extbf{W}indow extbf{A}ttention (PSWA), which fundamentally mitigates global model redundancy. PSWA achieves intermediate global-local information interaction through window attention, while employing a high-frequency bridging branch to simulate shifted window operations, supplementing appropriate global and high-frequency information. Furthermore, we propose the Progressive Coverage Channel Allocation(PCCA) strategy that captures high-order attention similarity without additional computational cost. Building upon all of them, we propose a series of Pseudo extbf{S}hifted extbf{Win}dow DiTs ( extbf{Swin DiT}), accompanied by extensive experiments demonstrating their superior performance. For example, our proposed Swin-DiT-L achieves a 54%$uparrow$ FID improvement over DiT-XL/2 while requiring less computational. https://github.com/wujiafu007/Swin-DiT