🤖 AI Summary
This work addresses the longstanding trade-off between efficiency and generation quality in text-to-image (T2I) synthesis. We propose Switti, a scale-aware non-causal Transformer architecture. Methodologically, it abandons autoregressive causal constraints to enable parallel multi-scale modeling; introduces a phased classifier-free guidance strategy that dynamically disables guidance at high-resolution stages to balance acceleration and detail fidelity; and incorporates scale-decomposed modeling with efficient sampling scheduling. Our contributions are threefold: (1) the first autoregressive model to match state-of-the-art diffusion models in both FID and human preference scores; (2) up to 32% faster sampling and 7× end-to-end speedup; and (3) significantly improved FID. Switti establishes a new paradigm for efficient, high-fidelity T2I generation.
📝 Abstract
This work presents Switti, a scale-wise transformer for text-to-image generation. We start by adapting an existing next-scale prediction autoregressive (AR) architecture to T2I generation, investigating and mitigating training stability issues in the process. Next, we argue that scale-wise transformers do not require causality and propose a non-causal counterpart facilitating ~21% faster sampling and lower memory usage while also achieving slightly better generation quality. Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. By disabling guidance at these scales, we achieve an additional sampling acceleration of ~32% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7x faster.