QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Efficient low-bit quantization of generative AI models (LLMs/VLMs) under hardware constraints remains challenging, especially at ultra-low bit-widths. Method: This paper proposes the first hardware-aware 3-bit dynamic quantization framework, integrating mixed-precision quantization, hardware-driven dynamic calibration, hierarchical range estimation, lightweight dequantization-operator fusion, and joint LLM/VLM adaptation. Contribution/Results: It achieves the first stable inference for generative models at 3-bit precision. Evaluated on LLaVA-v1.6, the quantized model retains over 94% of the original task performance while reducing end-to-end memory consumption by 67% and inference latency by 41% compared to state-of-the-art methods. The framework enables flexible, co-optimized trade-offs among accuracy, speed, and memory efficiency—demonstrating unprecedented practical viability for resource-constrained deployment of multimodal generative AI.

Technology Category

Application Category

📝 Abstract

We present QuantX: a tailored suite of recipes for LLM and VLM quantization. It is capable of quantizing down to 3-bit resolutions with minimal loss in performance. The quantization strategies in QuantX take into account hardware-specific constraints to achieve efficient dequantization during inference ensuring flexible trade-off between runtime speed, memory requirement and model accuracy. Our results demonstrate that QuantX achieves performance within 6% of the unquantized model for LlaVa-v1.6 quantized down to 3-bits for multiple end user tasks and outperforms recently published state-of-the-art quantization techniques. This manuscript provides insights into the LLM quantization process that motivated the range of recipes and options that are incorporated in QuantX.

Problem

Research questions and friction points this paper is trying to address.

Develop hardware-aware quantization for generative AI workloads

Achieve 3-bit quantization with minimal performance loss

Balance runtime speed, memory use, and model accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hardware-aware quantization for generative AI

3-bit quantization with minimal performance loss

Efficient dequantization balancing speed, memory, accuracy

🔎 Similar Papers

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration