🤖 AI Summary
Visual key-value (KV) caching in multimodal large language models (MLLMs) incurs excessive GPU memory overhead during inference. Method: We propose the first lossless 1-bit quantization method specifically designed for visual KV caches. Our approach innovatively integrates grouped quantization with a quantile-driven strategy to accommodate the non-uniform distribution of KV tensors, enabling token-level visual KV compression without modifying the model architecture or discarding any visual tokens. Contribution/Results: Evaluated across multiple mainstream MLLMs, our method reduces visual KV memory consumption by over 90% on average, achieves up to a 2.3× improvement in inference throughput, and degrades multimodal task accuracy by less than 0.5%. The method is plug-and-play and demonstrates strong generalization across diverse architectures and tasks.
📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable success across various applications, yet their computational overhead during deployment remains a critical challenge. While Key-Value (KV) caching improves inference efficiency by trading memory for computation, the growing memory footprint from storing extensive KV caches reduces throughput and limits long-term execution on devices with constrained GPU memory. Existing approaches primarily focus on dropping unimportant tokens to reduce the KV cache size, mitigating memory constraints at the cost of potential information loss. In contrast, we propose a simple yet effective visual quantization strategy that preserves all visual tokens while significantly reducing memory consumption. To achieve an extreme quantization ratio, i.e., 1-bit quantization, we propose group-specific quantization and quantile-based quantization approaches, motivated by the inherent patterns of the KV cache. Our method is plug-and-play, enabling seamless integration into various MLLMs to improve memory efficiency without architectural modifications. Extensive experiments demonstrate that our approach effectively reduces memory overhead while maintaining computational efficiency and preserving multimodal performance.