Swin DiT: Diffusion Transformer using Pseudo Shifted Windows

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost, redundant global attention, and sluggish low-frequency modeling in diffusion transformers (DiTs) for high-resolution image generation, this paper proposes pseudo-shifted window attention (PSWA) and progressively covering channel allocation (PCCA). PSWA combines localized windowing with a pseudo-shift mechanism to preserve global receptive fields while substantially reducing redundant computation and enhancing high-frequency detail modeling. PCCA dynamically allocates channels based on inter-channel attention similarity—without introducing extra parameters or computational overhead—to improve higher-order feature collaboration. Integrated into the Swin-DiT-L architecture, our method achieves a 54% improvement in Fréchet Inception Distance (FID) over DiT-XL/2, while simultaneously reducing GPU memory consumption and inference latency. This demonstrates a favorable trade-off between generation quality and computational efficiency.

Technology Category

Application Category

📝 Abstract
Diffusion Transformers (DiTs) achieve remarkable performance within the domain of image generation through the incorporation of the transformer architecture. Conventionally, DiTs are constructed by stacking serial isotropic global information modeling transformers, which face significant computational cost when processing high-resolution images. We empirically analyze that latent space image generation does not exhibit a strong dependence on global information as traditionally assumed. Most of the layers in the model demonstrate redundancy in global computation. In addition, conventional attention mechanisms exhibit low-frequency inertia issues. To address these issues, we propose extbf{P}seudo extbf{S}hifted extbf{W}indow extbf{A}ttention (PSWA), which fundamentally mitigates global model redundancy. PSWA achieves intermediate global-local information interaction through window attention, while employing a high-frequency bridging branch to simulate shifted window operations, supplementing appropriate global and high-frequency information. Furthermore, we propose the Progressive Coverage Channel Allocation(PCCA) strategy that captures high-order attention similarity without additional computational cost. Building upon all of them, we propose a series of Pseudo extbf{S}hifted extbf{Win}dow DiTs ( extbf{Swin DiT}), accompanied by extensive experiments demonstrating their superior performance. For example, our proposed Swin-DiT-L achieves a 54%$uparrow$ FID improvement over DiT-XL/2 while requiring less computational. https://github.com/wujiafu007/Swin-DiT
Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost in high-resolution image generation
Addresses redundancy in global computation of Diffusion Transformers
Mitigates low-frequency inertia in conventional attention mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pseudo Shifted Window Attention for global-local interaction
High-frequency bridging branch simulates shifted windows
Progressive Coverage Channel Allocation for attention similarity
🔎 Similar Papers
No similar papers found.
Jiafu Wu
Jiafu Wu
Tencent Youtu Lab
AIGCLLM
Y
Yabiao Wang
YouTu Lab, Tencent
J
Jian Li
YouTu Lab, Tencent
Jinlong Peng
Jinlong Peng
Tencent Youtu Lab
Computer VisionDeep Learning
Yun Cao
Yun Cao
researcher, tencent
CVGANs
C
Chengjie Wang
YouTu Lab, Tencent
J
Jiangning Zhang
YouTu Lab, Tencent