🤖 AI Summary
This study investigates the impact of low-bit quantization on out-of-distribution (OOD) detection performance in vision transformers, revealing a significant degradation—particularly under 4-bit settings. The authors systematically evaluate small-scale models including DeiT, DeiT3, and ViT, pretrained on either ImageNet-1k or ImageNet-22k, using metrics such as AUPR-out to assess OOD robustness post-quantization. Notably, the work demonstrates for the first time from an OOD perspective that large-scale pretraining on ImageNet-22k exacerbates the decline in OOD robustness after quantization, with AUPR-out dropping by 15.0%–19.2%, markedly higher than the 9.5%–12.0% reduction observed in ImageNet-1k-pretrained counterparts. Furthermore, the study highlights that data augmentation strategies are more effective than merely scaling up pretraining data for enhancing OOD robustness in quantized models.
📝 Abstract
Vision transformers have shown remarkable performance in vision tasks, but enabling them for accessible and real-time use is still challenging. Quantization reduces memory and inference costs at the risk of performance loss. Strides have been made to mitigate low precision issues mainly by understanding in-distribution (ID) task behaviour, but the attention mechanism may provide insight on quantization attributes by exploring out-of-distribution (OOD) situations. We investigate the behaviour of quantized small-variant popular vision transformers (DeiT, DeiT3, and ViT) on common OOD datasets. ID analyses show the initial instabilities of 4-bit models, particularly of those trained on the larger ImageNet-22k, as the strongest FP32 model, DeiT3, sharply drop 17% from quantization error to be one of the weakest 4-bit models. While ViT shows reasonable quantization robustness for ID calibration, OOD detection reveals more: ViT and DeiT3 pretrained on ImageNet-22k respectively experienced a 15.0% and 19.2% average quantization delta in AUPR-out between full precision to 4-bit while their ImageNet-1k-only counterparts experienced a 9.5% and 12.0% delta. Overall, our results suggest pretraining on large scale datasets may hinder low-bit quantization robustness in OOD detection and that data augmentation may be a more beneficial option.