LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying multimodal large language models (MLLMs) faces significant memory and computational overhead, while existing ultra-low-bit (<4-bit) quantization methods suffer from severe performance degradation. Method: This paper proposes the first post-training ultra-low-bit quantization framework tailored for MLLMs. Its core innovations include: (i) identifying inter-layer activation distribution heterogeneity induced by multimodal tokens, enabling entropy-based layer-wise dynamic bit allocation; and (ii) introducing a vision-language joint calibration mechanism to enhance quantization robustness. Contribution/Results: Evaluated on LLaVA-1.5 and Qwen2.5-VL, our method achieves up to 40% memory compression. On the MME benchmark, it incurs <10% accuracy drop relative to a 4-bit baseline—substantially outperforming prior ultra-low-bit approaches. This work establishes a new paradigm for efficient edge deployment of MLLMs.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. While post-training quantization (PTQ) has successfully compressed language models to as low as 1-bit precision without significant performance loss, its effectiveness for multimodal LLMs (MLLMs) remains relatively unexplored. In this paper, we present the first study on ultra-low bit (<4-bit) quantization for multimodal LLMs. Our analysis reveals that multimodal tokens and intermediate layer activations produced by them exhibit significantly higher statistical variance and entropy compared to text tokens, making them less tolerant to ultra-low bit quantization. However, the activation distributions of multimodal tokens varies significantly over different layers, with some layers having lower entropy activation distributions. We empirically show that such layers in these models can better tolerate ultra-low bit quantization. Building on these insights, we propose a novel strategy for MLLM quantization, LUQ: Layerwise Ultra-Low Bit Quantization, which selectively applies ultra-low bit quantization to layers that are more resilient to it. Additionally, we also show that using a mix of multimodal tokens (image and text) for PTQ boosts VQA performance in the ultra-low bit regime. We evaluate our method on LLaVA-1.5 and Qwen-2.5-VL across 9 popular VQA benchmarks. The resulting LUQ models use 40% and 31% less memory than their 4-bit counterparts, respectively, while exhibiting a performance degradation of less than 10% on the MME benchmark.
Problem

Research questions and friction points this paper is trying to address.

Compressing multimodal LLMs to ultra-low bit precision
Addressing high variance in multimodal token activations
Selectively applying quantization to resilient layers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layerwise ultra-low bit quantization for multimodal LLMs
Selectively applies ultra-low quantization to resilient layers
Uses mixed multimodal tokens to boost VQA performance
🔎 Similar Papers
No similar papers found.