FraQAT: Quantization Aware Training with Fractional bits

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address memory and computational bottlenecks in deploying diffusion models on mobile devices, this paper proposes Fractional-bit Quantization-Aware Training (FB-QAT), mitigating severe generative quality degradation during progressive quantization from 32-bit to 4-bit. The method features: (1) a differentiable fractional-bit quantization scheme that enables fine-grained information preservation between integer bit-widths; and (2) a progressive precision decay strategy coupled with gradient compensation to stabilize low-bit training dynamics. Evaluated on SD3.5-Medium, Sana, PixArt, and FLUX.1-schnell, FB-QAT achieves a 4–7% improvement in Fréchet Inception Distance (FiD) over baseline quantization methods—marking the first successful high-fidelity generation at 4-bit precision. The approach has been deployed on the Samsung S25U smartphone (powered by Snapdragon 8 Elite HTP), enabling efficient on-device inference.

Technology Category

Application Category

📝 Abstract

State-of-the-art (SOTA) generative models have demonstrated impressive capabilities in image synthesis or text generation, often with a large capacity model. However, these large models cannot be deployed on smartphones due to the limited availability of on-board memory and computations. Quantization methods lower the precision of the model parameters, allowing for efficient computations, eg, in INT{8}. Although aggressive quantization addresses efficiency and memory constraints, preserving the quality of the model remains a challenge. To retain quality in previous aggressive quantization, we propose a new fractional bits quantization (short) approach. The novelty is a simple yet effective idea: we progressively reduce the model's precision from 32 to 4 bits per parameter, and exploit the fractional bits during optimization to maintain high generation quality. We show that the short{} yields improved quality on a variety of diffusion models, including SD3.5-Medium, Sana, pixart, and FLUX.1-schnell, while achieving $4-7%$ lower FiD than standard QAT. Finally, we deploy and run Sana on a Samsung S25U, which runs on the Qualcomm SM8750-AB Snapdragon 8 Elite Hexagon Tensor Processor (HTP).

Problem

Research questions and friction points this paper is trying to address.

Quantization reduces model precision for mobile deployment

Aggressive quantization degrades generative model quality

Fractional bit optimization maintains quality during precision reduction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive precision reduction from 32 to 4 bits

Fractional bits exploitation during optimization phase

Maintains generation quality in aggressive quantization

🔎 Similar Papers

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration