H1B-KV: Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the excessive memory overhead of KV caches in autoregressive decoding of large language models (LLMs) and the difficulty of existing compression methods in simultaneously achieving completeness and efficiency, this paper proposes the first end-to-end minimalist KV compression scheme. It introduces 1-bit binary sketching for key vector encoding, couples it with 4-bit quantization for value vectors, and accelerates attention computation via bit-level operations; lightweight fine-tuning preserves accuracy. Crucially, no historical context is discarded, enabling hardware-friendly, high-fidelity compression. Experiments show that for a 7B model with an 8K-context window, the KV cache size is reduced to <60 MB—achieving a 70× compression ratio—while maintaining full-precision performance on GSM8K, MMLU, and HumanEval benchmarks. Our method significantly outperforms state-of-the-art approaches such as Loki.

Technology Category

Application Category

📝 Abstract
Autoregressive decoding in large language models (LLMs) requires caching a growing list of past key-value (KV) pairs, making long-context inference a memory-bound problem. While recent methods have explored quantizing the cache, evicting tokens, or using binary sketches for keys (e.g., Loki), these approaches often provide an incomplete solution by leaving one component (like values) uncompressed or by discarding context information. This paper introduces the Hybrid One-Bit KV Cache (H1B-KV), a comprehensive compression scheme that radically reduces memory usage without sacrificing context. H1B-KV represents each key vector using a 1-bit binary sketch, enabling hardware-friendly bitwise attention, and further compresses value vectors using 4-bit quantization. This holistic, hybrid approach allows a 7-billion parameter LLM to handle an 8k-token context with under 60 MB of cache memory - a 70x reduction. We demonstrate that after a lightweight finetuning, H1B-KV matches full-precision performance not only on perplexity benchmarks but also on complex downstream tasks like mathematical reasoning (GSM8K), multi-task understanding (MMLU), and code generation (HumanEval). Our results show H1B-KV significantly outperforms leading quantization (KIVI), token eviction (SparseLLM), and key-only sketching (Loki) methods in quality-per-byte, establishing it as a robust solution for deploying LLMs in memory-constrained environments.
Problem

Research questions and friction points this paper is trying to address.

Reducing memory usage for large language model inference caching
Compressing key-value pairs without sacrificing context information
Enabling long-context inference in memory-constrained environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid 1-bit binary sketches for key vectors
4-bit quantization for value vector compression
Hardware-friendly bitwise attention computation
🔎 Similar Papers
No similar papers found.