Yuan-TecSwin: A text conditioned Diffusion model with Swin-transformer blocks

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the limitations of CNN-based diffusion models in modeling long-range semantics and achieving precise text–image alignment due to convolutional locality, this paper proposes the first end-to-end text-guided diffusion model built upon the Swin Transformer. Our method integrates the Swin Transformer fully into both the encoder and decoder of the diffusion architecture. Key contributions include: (1) the first complete incorporation of Swin Transformer into diffusion model encoder–decoder stacks; (2) an adaptive timestep search strategy coupled with a dynamic timestep scheduling mechanism to optimize denoising efficiency; and (3) seamless integration of a CLIP text encoder with feature-level textual embedding fusion, enabling single-stage, auxiliary-model-free conditional generation. Evaluated on ImageNet, our model achieves a state-of-the-art FID score of 1.37 while accelerating inference by 10%. Qualitative assessment confirms that generated images are visually indistinguishable from real ones.

Technology Category

Application Category

📝 Abstract

Diffusion models have shown remarkable capacity in image synthesis based on their U-shaped architecture and convolutional neural networks (CNN) as basic blocks. The locality of the convolution operation in CNN may limit the model's ability to understand long-range semantic information. To address this issue, we propose Yuan-TecSwin, a text-conditioned diffusion model with Swin-transformer in this work. The Swin-transformer blocks take the place of CNN blocks in the encoder and decoder, to improve the non-local modeling ability in feature extraction and image restoration. The text-image alignment is improved with a well-chosen text encoder, effective utilization of text embedding, and careful design in the incorporation of text condition. Using an adapted time step to search in different diffusion stages, inference performance is further improved by 10%. Yuan-TecSwin achieves the state-of-the-art FID score of 1.37 on ImageNet generation benchmark, without any additional models at different denoising stages. In a side-by-side comparison, we find it difficult for human interviewees to tell the model-generated images from the human-painted ones.

Problem

Research questions and friction points this paper is trying to address.

Improves long-range semantic understanding in diffusion models

Enhances text-image alignment for text-conditioned image generation

Boosts inference performance using adaptive time step search

Innovation

Methods, ideas, or system contributions that make the work stand out.

Swin-transformer replaces CNN for long-range modeling

Improved text alignment via encoder and embedding design

Adaptive time step search boosts inference performance

🔎 Similar Papers

No similar papers found.

Authors to Follow