Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

📅 2024-12-02
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the longstanding trade-off between efficiency and generation quality in text-to-image (T2I) synthesis. We propose Switti, a scale-aware non-causal Transformer architecture. Methodologically, it abandons autoregressive causal constraints to enable parallel multi-scale modeling; introduces a phased classifier-free guidance strategy that dynamically disables guidance at high-resolution stages to balance acceleration and detail fidelity; and incorporates scale-decomposed modeling with efficient sampling scheduling. Our contributions are threefold: (1) the first autoregressive model to match state-of-the-art diffusion models in both FID and human preference scores; (2) up to 32% faster sampling and 7× end-to-end speedup; and (3) significantly improved FID. Switti establishes a new paradigm for efficient, high-fidelity T2I generation.

Technology Category

Application Category

📝 Abstract
This work presents Switti, a scale-wise transformer for text-to-image generation. We start by adapting an existing next-scale prediction autoregressive (AR) architecture to T2I generation, investigating and mitigating training stability issues in the process. Next, we argue that scale-wise transformers do not require causality and propose a non-causal counterpart facilitating ~21% faster sampling and lower memory usage while also achieving slightly better generation quality. Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. By disabling guidance at these scales, we achieve an additional sampling acceleration of ~32% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7x faster.
Problem

Research questions and friction points this paper is trying to address.

Improves text-to-image generation with scale-wise transformers.
Enhances training stability and sampling speed in T2I models.
Optimizes classifier-free guidance for better fine-grained detail generation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-causal transformer for faster sampling
Disabling high-resolution classifier-free guidance
Improved fine-grained detail generation
🔎 Similar Papers
No similar papers found.
A
Anton Voronov
Yandex Research, HSE University, MIPT
D
Denis Kuznedelev
Yandex Research, Skoltech
M
Mikhail Khoroshikh
ITMO University
Valentin Khrulkov
Valentin Khrulkov
AIRI
Machine LearningNumerical MathematicsMathematical Physics
Dmitry Baranchuk
Dmitry Baranchuk
Yandex Research
Generative ModelingComputer VisionSimilarity Search