🤖 AI Summary
This work addresses the high computational overhead of vision-language models under ultra-low-bit (e.g., W4A4) inference, primarily caused by long visual token prefixes and inflated KV caches. Existing quantization and pruning methods typically optimize these components separately, often failing due to inconsistencies between calibration and execution. To overcome this, the paper introduces QUOTA, a framework that jointly optimizes quantization and deterministic visual token pruning for the first time. QUOTA leverages quantization calibration signals to derive layer-wise token allocation strategies and guides top-k token selection under a fixed budget using activation magnitudes, attention weights, and low-bit risk indicators. Experiments show that with only 30% of visual tokens retained, QUOTA achieves an average performance retention of 95.65% on standard benchmarks, significantly outperforming sequential baseline approaches, which attain approximately 94.3%.
📝 Abstract
Deploying Vision-Language Models (VLMs) under aggressive low-bit inference remains challenging because inference cost is dominated by the long visual-token prefix during prefill and the growing KV cache during autoregressive decoding. Token pruning and low-bit quantization are complementary for reducing these costs, yet naive stage-wise combinations are often brittle due to a mismatch between quantization calibration and pruning execution. We present a collaborative quantization-and-pruning framework that unifies low-bit inference and deterministic visual-token pruning in a single deployable pipeline. The framework introduces the \textbf{Q}uantization \textbf{U}nified \textbf{O}ffline \textbf{T}oken \textbf{A}llocator (\textbf{QUOTA}), which converts low-bit calibration signals into a layer-wise token allocation schedule and materializes it as a pruning recipe. Token importance is evaluated under deployed W4A4 operators with a quantized KV cache by combining activation magnitude, attention cues, and an explicit low-bit risk signal, enabling consistent budgeted top-$k$ selection. Experiments on standard VLM benchmarks show improved robustness over stage-wise baselines under the same low-bit regime, achieving 95.65\% average retention while retaining only 30\% of visual tokens, compared with about 94.3\% retention for representative stage-wise combinations. The code will be released.