WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the high inference latency and substantial GPU memory consumption of video language models (VLMs), which stem from excessively long visual token sequences. While existing token-level mixed-precision KV cache quantization methods aim to mitigate these issues, they suffer from large search overhead and suboptimal hardware efficiency. To overcome these limitations, we propose WindowQuant, a novel window-level adaptive mixed-precision quantization framework that rapidly determines the optimal bit-width for each visual token window based on its similarity to the textual prompt. Furthermore, WindowQuant enhances hardware utilization through KV cache window reordering. Experimental results demonstrate that WindowQuant consistently outperforms state-of-the-art approaches across multiple benchmarks, achieving significant reductions in both inference latency and memory footprint with negligible accuracy degradation.

📝 Abstract

Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in VLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called WindowQuant, which employs window-adaptive mixed-precision quantization to optimize the KV cache. WindowQuant consists of two modules: window-level quantization search and window-level KV cache computation. Window-level quantization search quickly determines the optimal bit-width configuration of the KV cache windows based on the similarity scores between the corresponding visual token windows and the text prompt, maintaining the model accuracy. Furthermore, window-level KV cache computation reorders the KV cache windows before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that WindowQuant outperforms state-of-the-art VLM models and KV cache quantization methods on various datasets.

Problem

Research questions and friction points this paper is trying to address.

video language models

KV cache quantization

inference latency

GPU memory usage

hardware efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

mixed-precision quantization

KV cache

window-level similarity