🤖 AI Summary
This work addresses the challenge of deploying large text-to-image models on mobile devices due to their enormous parameter counts, which hinders on-device applications. The authors propose a distillation-driven progressive compression framework that efficiently compresses the 17B-parameter FLUX.1-Schnell model into a compact 2.4B-parameter variant named NanoFLUX. Key innovations include pruning-based compression of the diffusion Transformer, a ResNet-guided token downsampling mechanism, and cross-modal distillation for the text encoder that leverages early visual signals from the denoiser. The resulting model generates high-fidelity 512×512 images in approximately 2.5 seconds on mobile hardware, marking the first demonstration of feasible high-quality on-device text-to-image generation.
📝 Abstract
While large-scale text-to-image diffusion models continue to improve in visual quality, their increasing scale has widened the gap between state-of-the-art models and on-device solutions. To address this gap, we introduce NanoFLUX, a 2.4B text-to-image flow-matching model distilled from 17B FLUX.1-Schnell using a progressive compression pipeline designed to preserve generation quality. Our contributions include: (1) A model compression strategy driven by pruning redundant components in the diffusion transformer, reducing its size from 12B to 2B; (2) A ResNet-based token downsampling mechanism that reduces latency by allowing intermediate blocks to operate on lower-resolution tokens while preserving high-resolution processing elsewhere; (3) A novel text encoder distillation approach that leverages visual signals from early layers of the denoiser during sampling. Empirically, NanoFLUX generates 512 x 512 images in approximately 2.5 seconds on mobile devices, demonstrating the feasibility of high-quality on-device text-to-image generation.