🤖 AI Summary
Vector quantization (VQ) achieves lower distortion than uniform quantization under extreme compression, yet fine-tuning suffers from accuracy degradation: shared codewords force weights within the same group to update in identical directions—often misaligned with local gradient directions. This work proposes Sign-Separated Vector Quantization (SSVQ), the first VQ method to decouple weight signs from magnitudes. Signs are modeled as learnable latent variables, while magnitudes are clustered via k-means over positive values to construct a magnitude codebook; both components are jointly optimized. A progressive freezing strategy is introduced to mitigate optimization conflicts between sign and magnitude learning. By breaking VQ’s inherent directional coupling constraint, SSVQ significantly improves the compression–accuracy trade-off across diverse models and tasks. Hardware measurements demonstrate reduced memory access versus 8-bit baselines and a 3× inference speedup.
📝 Abstract
Vector Quantization (VQ) has emerged as a prominent weight compression technique, showcasing substantially lower quantization errors than uniform quantization across diverse models, particularly in extreme compression scenarios. However, its efficacy during fine-tuning is limited by the constraint of the compression format, where weight vectors assigned to the same codeword are restricted to updates in the same direction. Consequently, many quantized weights are compelled to move in directions contrary to their local gradient information. To mitigate this issue, we introduce a novel VQ paradigm, Sign-Splitting VQ (SSVQ), which decouples the sign bit of weights from the codebook. Our approach involves extracting the sign bits of uncompressed weights and performing clustering and compression on all-positive weights. We then introduce latent variables for the sign bit and jointly optimize both the signs and the codebook. Additionally, we implement a progressive freezing strategy for the learnable sign to ensure training stability. Extensive experiments on various modern models and tasks demonstrate that SSVQ achieves a significantly superior compression-accuracy trade-off compared to conventional VQ. Furthermore, we validate our algorithm on a hardware accelerator, showing that SSVQ achieves a 3$ imes$ speedup over the 8-bit compressed model by reducing memory access.