QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that multimodal large language models suffer from exacerbated quantization errors when jointly applying low-bit quantization and visual token pruning, primarily due to coupling effects wherein conventional semantic pruning inadvertently removes activation outliers critical for numerical stability. To mitigate this, the authors propose the first framework that explicitly co-optimizes post-training quantization and visual token pruning. Central to this approach is a lightweight hybrid sensitivity metric that integrates simulated grouped quantization error, outlier magnitude, and semantic relevance to prioritize tokens essential for both semantic fidelity and quantization robustness. Evaluated on LLaVA, the method achieves a 2.24% accuracy gain over the baseline while retaining only 12.5% of visual tokens, even outperforming the dense quantized model without pruning.
📝 Abstract
Multimodal Large Language Models (MLLMs) have shown strong reasoning ability, but their high computational and memory costs hinder deployment in resource-constrained settings. While Post-Training Quantization (PTQ) and vision token pruning are standard compression techniques, they are usually treated as independent optimizations. In this paper, we show that these two techniques are strongly coupled: naively applying semantic-based token pruning to PTQ-optimized MLLMs can discard activation outliers that are important for numerical stability and thus worsen quantization errors in low-bit regimes (\textit{e.g.}, W4A4). To address this issue, we propose a quantization-aware vision token pruning framework. Our method introduces a lightweight hybrid sensitivity metric that combines simulated group-wise quantization error with outlier intensity. By combining this metric with standard semantic relevance scores, the method retains tokens that are both semantically informative and robust to quantization. Experiments on standard LLaVA architectures show that our method consistently outperforms naive integration baselines. At an aggressive pruning ratio that retains only 12.5\% of visual tokens, our framework improves accuracy by 2.24\% over the baseline and even surpasses dense quantization without pruning. To the best of our knowledge, this is the first method that explicitly co-optimizes vision token pruning and PTQ for accurate low-bit MLLM inference.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Post-Training Quantization
Vision Token Pruning
Quantization Error
Low-bit Inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantization-Aware Pruning
Vision Token Pruning
Post-Training Quantization
Multimodal Large Language Models
Outlier Preservation
X
Xinhao Wang
Wangxuan Institute of Computer Technology, Peking University
Z
Zhonyu Xia
Wangxuan Institute of Computer Technology, Peking University
Zhiwei Lin
Zhiwei Lin
Peking University
3D perceptionopen-world perceptionself-supervised learningautonomous driving
Zhe Li
Zhe Li
Rochester Institute of Technology
Distributed Machine LearningOptimization
Y
Yongtao Wang
Wangxuan Institute of Computer Technology, Peking University