Towards Joint Quantization and Token Pruning of Vision-Language Models

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the high computational overhead of vision-language models under ultra-low-bit (e.g., W4A4) inference, primarily caused by long visual token prefixes and inflated KV caches. Existing quantization and pruning methods typically optimize these components separately, often failing due to inconsistencies between calibration and execution. To overcome this, the paper introduces QUOTA, a framework that jointly optimizes quantization and deterministic visual token pruning for the first time. QUOTA leverages quantization calibration signals to derive layer-wise token allocation strategies and guides top-k token selection under a fixed budget using activation magnitudes, attention weights, and low-bit risk indicators. Experiments show that with only 30% of visual tokens retained, QUOTA achieves an average performance retention of 95.65% on standard benchmarks, significantly outperforming sequential baseline approaches, which attain approximately 94.3%.

Technology Category

Application Category

📝 Abstract

Deploying Vision-Language Models (VLMs) under aggressive low-bit inference remains challenging because inference cost is dominated by the long visual-token prefix during prefill and the growing KV cache during autoregressive decoding. Token pruning and low-bit quantization are complementary for reducing these costs, yet naive stage-wise combinations are often brittle due to a mismatch between quantization calibration and pruning execution. We present a collaborative quantization-and-pruning framework that unifies low-bit inference and deterministic visual-token pruning in a single deployable pipeline. The framework introduces the \textbf{Q}uantization \textbf{U}nified \textbf{O}ffline \textbf{T}oken \textbf{A}llocator (\textbf{QUOTA}), which converts low-bit calibration signals into a layer-wise token allocation schedule and materializes it as a pruning recipe. Token importance is evaluated under deployed W4A4 operators with a quantized KV cache by combining activation magnitude, attention cues, and an explicit low-bit risk signal, enabling consistent budgeted top-$k$ selection. Experiments on standard VLM benchmarks show improved robustness over stage-wise baselines under the same low-bit regime, achieving 95.65\% average retention while retaining only 30\% of visual tokens, compared with about 94.3\% retention for representative stage-wise combinations. The code will be released.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Low-bit Quantization

Token Pruning

KV Cache

Inference Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

joint quantization and pruning

low-bit inference

token pruning