MGVQ: Synergizing Multi-dimensional Sensitivity-Aware and Gradient-Hessian Fusion for Vector Quantization

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vector quantization methods for vision-language models struggle to account for cross-modal weight distribution discrepancies and introduce compensation bias due to the neglect of first-order gradient information. To address these limitations, this work proposes MGVQ, a novel framework that jointly models channel sensitivity and gradient–Hessian information for highly efficient ultra-low-bit quantization. The core innovations include a mixed-precision allocation strategy guided by multidimensional sensitivity analysis and a second-order error compensation mechanism integrating gradient embedding with Kronecker and Block-LDL decompositions. Experimental results demonstrate that MGVQ substantially improves performance across multiple state-of-the-art vision-language models—LLaVA-OneVision, InternVL2, and Qwen2-VL—with gains of up to 4.9 accuracy points at 2-bit quantization, achieving 71.4% on InternVL2-26B.
📝 Abstract
Vision-Language Models (VLMs) achieve outstanding performance, yet their huge model size severely hinders deployment on edge devices with limited resources. As an efficient model compression technique, vector quantization (VQ) excels in ultra-low-bit representation, which maps model weights to discrete codewords in a compact codebook to cut memory consumption and transmission overhead while preserving model capability. Direct VQ application to VLMs still has two core limitations. First, cross-modality weight distribution differences brought by visual and textual inputs cannot be well fitted by a single unified codebook. Second, current second-order error compensation ignores first-order gradient information, causing weight deviation from pre-trained optimal states, gradient drift and biased compensation results. This work proposes MGVQ, a novel vector quantization framework integrating multi-dimensional sensitivity perception and gradient-Hessian fusion. It consists of two core modules: sensitivity-guided structured mixed-precision quantization dynamically assigns different bit-widths according to channel sensitivity via combined global and local sensitivity analysis for refined resource allocation; gradient-aware second-order error compensation embeds first-order gradients into error correction, and adopts Kronecker and Block-LDL decomposition to ensure low computational cost. Extensive experiments on mainstream VLMs including LLaVA-onevision, InternVL2 and Qwen2-VL verify the effectiveness of MGVQ. In 2-bit quantization settings, MGVQ surpasses existing advanced post-training quantization methods significantly, achieving a maximum accuracy improvement of 4.9 points (71.4% vs 67.0% on InternVL2-26B). The proposed method realizes stable and efficient ultra-low-bit VLM quantization, greatly promoting the practical deployment of multimodal large models in resource-limited environments.
Problem

Research questions and friction points this paper is trying to address.

vector quantization
vision-language models
cross-modality weight distribution
gradient drift
error compensation
Innovation

Methods, ideas, or system contributions that make the work stand out.

vector quantization
sensitivity-aware
gradient-Hessian fusion
mixed-precision
vision-language models
🔎 Similar Papers
No similar papers found.