Joint Post-Training Quantization of Vision Transformers with Learned Prompt-Guided Data Generation

📅 2026-02-21

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the severe accuracy degradation of vision Transformers under ultra-low-bit post-training quantization (PTQ)—such as W1.58A8—caused by the absence of labeled calibration data. The authors propose an end-to-end joint quantization framework that operates without real labels, leveraging learnable multimodal prompts to guide Stable Diffusion Turbo in generating diverse and representative synthetic samples for calibration. This approach enables global optimization of various architectures, including ViT, DeiT, and Swin-T. Notably, it achieves state-of-the-art PTQ performance at extremely low bit-widths, matching or exceeding the accuracy of methods using real data: it sets new records on ImageNet for W4A4 and W3A3 quantization and maintains strong accuracy even at W1.58A8. Furthermore, the entire quantization process for ViT-small completes within one hour on a single GPU.

Technology Category

Application Category

📝 Abstract

We present a framework for end-to-end joint quantization of Vision Transformers trained on ImageNet for the purpose of image classification. Unlike prior post-training or block-wise reconstruction methods, we jointly optimize over the entire set of all layers and inter-block dependencies without any labeled data, scaling effectively with the number of samples and completing in just one hour on a single GPU for ViT-small. We achieve state-of-the-art W4A4 and W3A3 accuracies on ImageNet and, to the best of our knowledge, the first PTQ results that maintain strong accuracy on ViT, DeiT, and Swin-T models under extremely low-bit settings (W1.58A8), demonstrating the potential for efficient edge deployment. Furthermore, we introduce a data-free calibration strategy that synthesizes diverse, label-free samples using Stable Diffusion Turbo guided by learned multi-mode prompts. By encouraging diversity in both the learned prompt embeddings and the generated image features, our data-free approach achieves performance on par with real-data ImageNet calibration and surpasses simple text-prompt baselines such as "a <adjective> photo of <adjective> <cls>".

Problem

Research questions and friction points this paper is trying to address.

Post-Training Quantization

Vision Transformers

Data-Free Calibration

Low-Bit Quantization

Image Classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

post-training quantization

Vision Transformers

data-free calibration