🤖 AI Summary
This work addresses the lack of systematic investigation into format selection and performance trade-offs in existing low-bit quantization-aware training (QAT) methods, as well as their insufficient evaluation on generative tasks. To this end, we propose the first integration of k-means clustering into QAT for 1-bit weight quantization, optimizing generative performance under a fixed inference memory budget. Our approach transcends the limitations of conventional integer-based quantization schemes by leveraging learned cluster centroids to better preserve model fidelity at ultra-low bitwidths. Experimental results demonstrate that, under identical memory constraints, our method significantly outperforms state-of-the-art integer quantization approaches while maintaining compatibility with general-purpose hardware for efficient deployment.
📝 Abstract
Quantization-aware training (QAT) is an effective method to drastically reduce the memory footprint of LLMs while keeping performance degradation at an acceptable level. However, the optimal choice of quantization format and bit-width presents a challenge in practice. The full design space of quantization is not fully explored in the context of QAT, and the precise trade-off between quantization and downstream performance is poorly understood, as comparisons often rely solely on perplexity-based evaluations. In this work, we address these shortcomings with an empirical study of QAT in the low-bit regime. We show that k-means based weight quantization outperforms integer formats and can be implemented efficiently on standard hardware. Furthermore, we find that, under a fixed inference memory budget, the best performance on generative downstream tasks is achieved with $1$-bit quantized weights.