🤖 AI Summary
To address memory and computational bottlenecks in deploying diffusion models on mobile devices, this paper proposes Fractional-bit Quantization-Aware Training (FB-QAT), mitigating severe generative quality degradation during progressive quantization from 32-bit to 4-bit. The method features: (1) a differentiable fractional-bit quantization scheme that enables fine-grained information preservation between integer bit-widths; and (2) a progressive precision decay strategy coupled with gradient compensation to stabilize low-bit training dynamics. Evaluated on SD3.5-Medium, Sana, PixArt, and FLUX.1-schnell, FB-QAT achieves a 4–7% improvement in Fréchet Inception Distance (FiD) over baseline quantization methods—marking the first successful high-fidelity generation at 4-bit precision. The approach has been deployed on the Samsung S25U smartphone (powered by Snapdragon 8 Elite HTP), enabling efficient on-device inference.
📝 Abstract
State-of-the-art (SOTA) generative models have demonstrated impressive capabilities in image synthesis or text generation, often with a large capacity model. However, these large models cannot be deployed on smartphones due to the limited availability of on-board memory and computations. Quantization methods lower the precision of the model parameters, allowing for efficient computations, eg, in INT{8}. Although aggressive quantization addresses efficiency and memory constraints, preserving the quality of the model remains a challenge. To retain quality in previous aggressive quantization, we propose a new fractional bits quantization (short) approach. The novelty is a simple yet effective idea: we progressively reduce the model's precision from 32 to 4 bits per parameter, and exploit the fractional bits during optimization to maintain high generation quality. We show that the short{} yields improved quality on a variety of diffusion models, including SD3.5-Medium, Sana, pixart, and FLUX.1-schnell, while achieving $4-7%$ lower FiD than standard QAT. Finally, we deploy and run Sana on a Samsung S25U, which runs on the Qualcomm SM8750-AB Snapdragon 8 Elite Hexagon Tensor Processor (HTP).