🤖 AI Summary
This work addresses the significant performance degradation in low-bit quantized Vision Transformers caused by activation outliers, a challenge inadequately tackled by existing methods that struggle to balance outlier suppression with accuracy preservation. The authors propose Colinearity-Decay (CD), a non-intrusive and decoupled structural regularization technique that mitigates harmful collinearity between ordered matrix pairs within Transformer modules during training, thereby effectively alleviating outlier activations. Notably, CD requires no modifications to the model architecture or task-specific loss functions and incurs no additional inference overhead. Extensive experiments demonstrate that CD consistently enhances model accuracy across various low-bit quantization pipelines on ImageNet-1K pretraining, COCO object detection, and multiple downstream fine-tuning tasks, while maintaining or even improving full-precision performance.
📝 Abstract
Low-bit quantization is a practical route for efficiently deploying vision Transformers, yet activation outliers complicate fully quantized deployment. Existing methods either handle quantization post-training or suppress large activations during training; however, aggressively restricting outliers in vision models can lead to a poorer trade-off between full-precision and quantized accuracy. We argue that rather than simply suppressing outliers, the training objective should control the structural amplification that makes them harmful. To this end, we introduce Colinearity-Decay (CD), a structural regularizer for ordered matrix pairs within Transformer blocks. CD penalizes detrimental cross-matrix alignment and mitigates extreme activations without altering the architecture or task loss. Applied as a decoupled update, CD is non-invasive and introduces minimal training overhead. Across ImageNet-1K pre-training, COCO detection, and downstream fine-tuning, CD consistently boosts quantized accuracy across multiple pipelines while preserving, or even improving, full-precision performance. Ultimately, our results demonstrate that structural regularization effectively prepares vision Transformers for low-bit deployment with zero inference-time overhead.