Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

📅 2024-12-02

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

151K/year

🤖 AI Summary

This work addresses the longstanding trade-off between efficiency and generation quality in text-to-image (T2I) synthesis. We propose Switti, a scale-aware non-causal Transformer architecture. Methodologically, it abandons autoregressive causal constraints to enable parallel multi-scale modeling; introduces a phased classifier-free guidance strategy that dynamically disables guidance at high-resolution stages to balance acceleration and detail fidelity; and incorporates scale-decomposed modeling with efficient sampling scheduling. Our contributions are threefold: (1) the first autoregressive model to match state-of-the-art diffusion models in both FID and human preference scores; (2) up to 32% faster sampling and 7× end-to-end speedup; and (3) significantly improved FID. Switti establishes a new paradigm for efficient, high-fidelity T2I generation.

Technology Category

Application Category

📝 Abstract

This work presents Switti, a scale-wise transformer for text-to-image generation. We start by adapting an existing next-scale prediction autoregressive (AR) architecture to T2I generation, investigating and mitigating training stability issues in the process. Next, we argue that scale-wise transformers do not require causality and propose a non-causal counterpart facilitating ~21% faster sampling and lower memory usage while also achieving slightly better generation quality. Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. By disabling guidance at these scales, we achieve an additional sampling acceleration of ~32% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7x faster.

Problem

Research questions and friction points this paper is trying to address.

Improves text-to-image generation with scale-wise transformers.

Enhances training stability and sampling speed in T2I models.

Optimizes classifier-free guidance for better fine-grained detail generation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-causal transformer for faster sampling

Disabling high-resolution classifier-free guidance

Improved fine-grained detail generation

🔎 Similar Papers

STAR: Scale-wise Text-conditioned AutoRegressive image generation