🤖 AI Summary
Efficient low-bit quantization of generative AI models (LLMs/VLMs) under hardware constraints remains challenging, especially at ultra-low bit-widths. Method: This paper proposes the first hardware-aware 3-bit dynamic quantization framework, integrating mixed-precision quantization, hardware-driven dynamic calibration, hierarchical range estimation, lightweight dequantization-operator fusion, and joint LLM/VLM adaptation. Contribution/Results: It achieves the first stable inference for generative models at 3-bit precision. Evaluated on LLaVA-v1.6, the quantized model retains over 94% of the original task performance while reducing end-to-end memory consumption by 67% and inference latency by 41% compared to state-of-the-art methods. The framework enables flexible, co-optimized trade-offs among accuracy, speed, and memory efficiency—demonstrating unprecedented practical viability for resource-constrained deployment of multimodal generative AI.
📝 Abstract
We present QuantX: a tailored suite of recipes for LLM and VLM quantization. It is capable of quantizing down to 3-bit resolutions with minimal loss in performance. The quantization strategies in QuantX take into account hardware-specific constraints to achieve efficient dequantization during inference ensuring flexible trade-off between runtime speed, memory requirement and model accuracy. Our results demonstrate that QuantX achieves performance within 6% of the unquantized model for LlaVa-v1.6 quantized down to 3-bits for multiple end user tasks and outperforms recently published state-of-the-art quantization techniques. This manuscript provides insights into the LLM quantization process that motivated the range of recipes and options that are incorporated in QuantX.