IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing post-training quantization (PTQ) methods struggle to enable fully integer-only inference for nonlinear layers—particularly GELU and Softmax—in vision Transformers, often resorting to activation distribution tuning or partial quantization, thereby compromising either accuracy or efficiency. To address this, we propose IPTQ-ViT, the first retraining-free, fully integer PTQ framework for ViTs. First, we replace GELU with a low-degree polynomial approximation and Softmax with bit-shift-based operations, eliminating floating-point nonlinearities entirely. Second, we introduce a unified layer-wise metric that jointly considers quantization sensitivity, output perturbation, and computational overhead to adaptively select the optimal approximation per layer. Evaluated on image classification, IPTQ-ViT achieves an average +1.78% top-1 accuracy gain (up to +6.44%), and improves object detection by +1.0 mAP—matching quantization-aware training performance—while enabling efficient W8A8/W4A8 deployment.

Technology Category

Application Category

📝 Abstract
Previous Quantization-Aware Training (QAT) methods for vision transformers rely on expensive retraining to recover accuracy loss in non-linear layer quantization, limiting their use in resource-constrained environments. In contrast, existing Post-Training Quantization (PTQ) methods either partially quantize non-linear functions or adjust activation distributions to maintain accuracy but fail to achieve fully integer-only inference. In this paper, we introduce IPTQ-ViT, a novel PTQ framework for fully integer-only vision transformers without retraining. We present approximation functions: a polynomial-based GELU optimized for vision data and a bit-shifting-based Softmax designed to improve approximation accuracy in PTQ. In addition, we propose a unified metric integrating quantization sensitivity, perturbation, and computational cost to select the optimal approximation function per activation layer. IPTQ-ViT outperforms previous PTQ methods, achieving up to 6.44%p (avg. 1.78%p) top-1 accuracy improvement for image classification, 1.0 mAP for object detection. IPTQ-ViT outperforms partial floating-point PTQ methods under W8A8 and W4A8, and achieves accuracy and latency comparable to integer-only QAT methods. We plan to release our code https://github.com/gihwan-kim/IPTQ-ViT.git.
Problem

Research questions and friction points this paper is trying to address.

Achieving fully integer-only vision transformers without retraining
Optimizing quantization of non-linear functions like GELU and Softmax
Improving accuracy and latency in resource-constrained environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Polynomial-based GELU approximation for vision data
Bit-shifting-based Softmax for integer-only inference
Unified metric for optimal approximation function selection
🔎 Similar Papers
No similar papers found.