🤖 AI Summary
To address the high memory and computational overhead of FLUX.1-dev—a text-to-image diffusion model—during 1024×1024 high-resolution generation, this work proposes the first data-free, self-supervised ternary quantization method, constraining weights to {−1, 0, +1} (1.58 bits). We design custom CUDA/MLIR inference kernels specifically optimized for ternary weights and achieve end-to-end integration with the FLUX.1-dev architecture. Crucially, our approach requires no image data or fine-tuning, circumventing the data dependency and optimization challenges typical of ultra-low-bit quantization. Experiments demonstrate a 7.7× reduction in model size, a 5.1× decrease in GPU memory consumption during inference, and substantial latency reduction—all while preserving generation quality comparable to the full-precision baseline on GenEval and T2I-CompBench. This work establishes a new paradigm for efficient deployment of high-fidelity text-to-image models without compromising visual fidelity.
📝 Abstract
We present 1.58-bit FLUX, the first successful approach to quantizing the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (i.e., values in {-1, 0, +1}) while maintaining comparable performance for generating 1024 x 1024 images. Notably, our quantization method operates without access to image data, relying solely on self-supervision from the FLUX.1-dev model. Additionally, we develop a custom kernel optimized for 1.58-bit operations, achieving a 7.7x reduction in model storage, a 5.1x reduction in inference memory, and improved inference latency. Extensive evaluations on the GenEval and T2I Compbench benchmarks demonstrate the effectiveness of 1.58-bit FLUX in maintaining generation quality while significantly enhancing computational efficiency.