From 16-Bit to 1-Bit: Visual KV Cache Quantization for Memory-Efficient Multimodal Large Language Models

📅 2025-02-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Visual key-value (KV) caching in multimodal large language models (MLLMs) incurs excessive GPU memory overhead during inference. Method: We propose the first lossless 1-bit quantization method specifically designed for visual KV caches. Our approach innovatively integrates grouped quantization with a quantile-driven strategy to accommodate the non-uniform distribution of KV tensors, enabling token-level visual KV compression without modifying the model architecture or discarding any visual tokens. Contribution/Results: Evaluated across multiple mainstream MLLMs, our method reduces visual KV memory consumption by over 90% on average, achieves up to a 2.3× improvement in inference throughput, and degrades multimodal task accuracy by less than 0.5%. The method is plug-and-play and demonstrates strong generalization across diverse architectures and tasks.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success across various applications, yet their computational overhead during deployment remains a critical challenge. While Key-Value (KV) caching improves inference efficiency by trading memory for computation, the growing memory footprint from storing extensive KV caches reduces throughput and limits long-term execution on devices with constrained GPU memory. Existing approaches primarily focus on dropping unimportant tokens to reduce the KV cache size, mitigating memory constraints at the cost of potential information loss. In contrast, we propose a simple yet effective visual quantization strategy that preserves all visual tokens while significantly reducing memory consumption. To achieve an extreme quantization ratio, i.e., 1-bit quantization, we propose group-specific quantization and quantile-based quantization approaches, motivated by the inherent patterns of the KV cache. Our method is plug-and-play, enabling seamless integration into various MLLMs to improve memory efficiency without architectural modifications. Extensive experiments demonstrate that our approach effectively reduces memory overhead while maintaining computational efficiency and preserving multimodal performance.

Problem

Research questions and friction points this paper is trying to address.

Reduce memory consumption in MLLMs

Maintain computational efficiency in MLLMs

Preserve multimodal performance in MLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual KV cache quantization

Group-specific quantization

Quantile-based quantization

🔎 Similar Papers

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

2024-01-31Neural Information Processing SystemsCitations: 193

Authors to Follow