๐ค AI Summary
This work addresses the limitations of multimodal large language models (MLLMs) in visual question answering, where overconfidence and poor deployment efficiency hinder practical use. While post-training quantization offers model compression, it often compromises accuracy and reliability. We present the first systematic analysis of quantizationโs adverse effects on MLLM reliability and evaluate performance degradation across bit-widths for Qwen2-VL-7B and Idefics3-8B using both data-agnostic (HQQ) and data-aware (MBQ) quantization methods. Furthermore, we successfully adapt the Selector confidence estimator to quantized multimodal settings, substantially mitigating reliability loss. Experiments demonstrate that int4 MBQ combined with Selector achieves nearly original-model accuracy and reliability while reducing memory usage by approximately 75%, striking an excellent balance between efficiency and reliability.
๐ Abstract
Multimodal Large Language Models (MLLM) are increasingly deployed in domains where both reliability and efficiency are critical. However, current models remain overconfident, producing highly certain but incorrect answers. At the same time, their large size limits deployment on edge devices, necessitating compression. We study the intersection of these two challenges by analyzing how Post-Training Quantization (PTQ) compression affects both accuracy and reliability in Visual Question Answering (VQA). We evaluate two MLLMs, Qwen2-VL-7B and Idefics3-8B, quantized with data-free (HQQ) and data-aware (MBQ) methods across multiple bit widths. To counteract the reduction in reliability caused by quantization, we adapt the Selector confidence estimator for quantized multimodal settings and test its robustness across various quantization levels and out-of-distribution (OOD) scenarios. We find that PTQ degrades both accuracy and reliability. Data-aware methods soften the effect thereof. The Selector substantially mitigates the reliability impact. The combination of int4 MBQ and the Selector achieves the best efficiency-reliability trade-off, closing in on uncompressed performance at approx. 75% less memory demand. Overall, we present the first systematic study linking quantization and reliability in multimodal settings.