MBQ: Modality-Balanced Quantization for Large Vision-Language Models

📅 2024-12-27

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing post-training quantization (PTQ) methods overlook the heterogeneous sensitivity of visual and linguistic tokens to quantization in large vision-language models (VLMs), leading to substantial accuracy degradation. This work is the first to empirically reveal and characterize the modality-specific quantization sensitivity of multimodal tokens in VLMs. We propose Modality-Balanced Quantization (MBQ), a fine-tuning-free, plug-and-play PTQ framework that incorporates modality-aware calibration: it differentially models cross-modal sensitivity during calibration to minimize reconstruction error. MBQ supports low-bit quantization (e.g., W3/W4A8) and integrates GPU kernel optimizations unifying dequantization and GEMV operations. Evaluated on VLMs ranging from 7B to 70B parameters, MBQ achieves up to 4.4% (W3) and 11.6% (W4A8) higher task accuracy than state-of-the-art PTQ methods. At W3, it delivers a 1.4× inference speedup for LLaVA-OneVision-7B on an RTX 4090 GPU.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have enabled a variety of real-world applications. The large parameter size of VLMs brings large memory and computation overhead which poses significant challenges for deployment. Post-Training Quantization (PTQ) is an effective technique to reduce the memory and computation overhead. Existing PTQ methods mainly focus on large language models (LLMs), without considering the differences across other modalities. In this paper, we discover that there is a significant difference in sensitivity between language and vision tokens in large VLMs. Therefore, treating tokens from different modalities equally, as in existing PTQ methods, may over-emphasize the insensitive modalities, leading to significant accuracy loss. To deal with the above issue, we propose a simple yet effective method, Modality-Balanced Quantization (MBQ), for large VLMs. Specifically, MBQ incorporates the different sensitivities across modalities during the calibration process to minimize the reconstruction loss for better quantization parameters. Extensive experiments show that MBQ can significantly improve task accuracy by up to 4.4% and 11.6% under W3 and W4A8 quantization for 7B to 70B VLMs, compared to SOTA baselines. Additionally, we implement a W3 GPU kernel that fuses the dequantization and GEMV operators, achieving a 1.4x speedup on LLaVA-onevision-7B on the RTX 4090. The code is available at https://github.com/thu-nics/MBQ.

Problem

Research questions and friction points this paper is trying to address.

Post-Training Quantization

Visual-Linguistic Models

Accuracy Degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modal Balance Quantization (MBQ)

Visual-Linguistic Model Optimization

Accuracy and Efficiency Enhancement

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs