Mixed Non-linear Quantization for Vision Transformers

📅 2024-07-26

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing Vision Transformer (ViT) quantization methods primarily focus on model compression while neglecting quantization errors in non-linear operations (e.g., GELU, Softmax); moreover, prevailing approaches adopt uniform quantization schemes, lacking layer-wise adaptability. To address this, we propose Hybrid Non-linear Quantization (HNQ), the first method to introduce a layer-aware Signal-to-Quantization-Noise Ratio (SQNR) discrepancy metric for evaluating the sensitivity of non-linear layers. Based on this metric, HNQ dynamically assigns the optimal, layer-specific quantization scheme—supporting both 8-bit and 6-bit training-time quantization—to minimize per-layer error. Departing from conventional uniform quantization, HNQ achieves substantial gains across ViT, DeiT, and Swin: +0.6% average accuracy at 8-bit and up to +19.6% at 6-bit, outperforming I-BERT, FQ-ViT, and I-ViT. Notably, under constrained training resources, it consistently improves accuracy by +20.8%.

Technology Category

Application Category

📝 Abstract

The majority of quantization methods have been proposed to reduce the model size of Vision Transformers, yet most of them have overlooked the quantization of non-linear operations. Only a few works have addressed quantization for non-linear operations, but they applied a single quantization method across all non-linear operations. We believe that this can be further improved by employing a different quantization method for each non-linear operation. Therefore, to assign the most error-minimizing quantization method from the known methods to each non-linear layer, we propose a mixed non-linear quantization that considers layer-wise quantization sensitivity measured by SQNR difference metric. The results show that our method outperforms I-BERT, FQ-ViT, and I-ViT in both 8-bit and 6-bit settings for ViT, DeiT, and Swin models by an average of 0.6%p and 19.6%p, respectively. Our method outperforms I-BERT and I-ViT by 0.6%p and 20.8%p, respectively, when training time is limited. We plan to release our code at https://gitlab.com/ones-ai/mixed-non-linear-quantization.

Problem

Research questions and friction points this paper is trying to address.

Quantizing non-linear operations in Vision Transformers

Optimizing layer-wise quantization sensitivity for error minimization

Improving accuracy over existing methods in limited-bit settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixed non-linear quantization for Vision Transformers

Layer-wise quantization sensitivity analysis

Error-minimizing quantization per non-linear operation

🔎 Similar Papers

DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers