🤖 AI Summary
This work addresses the critical challenge of adapting diffusion distillation to free-form text prompts in open-domain text-to-image (T2I) generation. We present the first systematic transfer of state-of-the-art diffusion distillation techniques to the powerful T2I teacher model FLUX.1-lite. To this end, we propose a unified distillation framework that identifies text-conditioning-induced optimization instability as the root cause and introduces four synergistic strategies: input scaling, dynamic noise scheduling, cross-modal feature alignment, and lightweight network architecture co-optimization. Our method achieves significant inference acceleration—within ≤8 sampling steps—while preserving high visual fidelity, consistently outperforming existing T2I distillation approaches across multi-scale quantitative and qualitative evaluations. To foster reproducibility and practical deployment, we publicly release our complete codebase and pre-trained lightweight student models, enabling efficient on-device T2I generation.
📝 Abstract
Diffusion distillation has dramatically accelerated class-conditional image synthesis, but its applicability to open-ended text-to-image (T2I) generation is still unclear. We present the first systematic study that adapts and compares state-of-the-art distillation techniques on a strong T2I teacher model, FLUX.1-lite. By casting existing methods into a unified framework, we identify the key obstacles that arise when moving from discrete class labels to free-form language prompts. Beyond a thorough methodological analysis, we offer practical guidelines on input scaling, network architecture, and hyperparameters, accompanied by an open-source implementation and pretrained student models. Our findings establish a solid foundation for deploying fast, high-fidelity, and resource-efficient diffusion generators in real-world T2I applications. Code is available on github.com/alibaba-damo-academy/T2I-Distill.