Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the significant memory and bandwidth overhead of KV cache in on-device long-context inference with large language models by proposing an adaptive KV cache quantization method. Inspired by Huffman coding, it introduces a token-importance-aware variable-bit allocation mechanism that dynamically selects among {2/4/8-bit, FP16} precisions during decoding via a lightweight controller. The controller leverages low-overhead features—such as token frequency, quality scores, attention variance, and entropy—to enable efficient real-time decisions. Evaluated on the SmolLM model family, the approach substantially outperforms static and rule-based baselines: for instance, on SmolLM-360M, it reduces latency by 17.75% and improves accuracy by 7.60 points on HellaSwag, trailing FP16 by only 0.30 points.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved remarkable progress across reasoning, generation, and decision-making tasks, yet deploying them on mobile, embedded, and edge devices remains particularly challenging. On-device LLM inference is heavily constrained by the memory and bandwidth overhead of the key-value (KV) cache, which grows linearly with context length and often dominates decoding cost. Existing KV-cache quantization schemes typically rely on fixed precision or hand-crafted heuristics, thereby wasting bits on low-impact tokens while over-compressing informative ones, leading to avoidable accuracy degradation. Inspired by Huffman coding's principle of variable-length allocation, we propose adaptive KV-cache quantization, a learned policy that assigns bit-width proportional to token importance, minimizing expected memory and latency without sacrificing competitive accuracy. Our framework extracts lightweight token-level features, including token frequency, quality score, attention variance, and entropy-based uncertainty, and feeds them into a compact data-driven controller that dynamically selects KV precision from {2-bit, 4-bit, 8-bit, FP16} during decoding. This adaptive precision policy reduces KV memory footprint and latency while improving accuracy compared to static KV quantization and rule-based baselines, and maintaining competitive accuracy close to FP16 inference across standard LLM benchmarks. Extensive experiments across multiple commonsense reasoning benchmarks using SmolLM-135M, SmolLM-360M, and SmolLM-1.7B demonstrate that our controller consistently improves the accuracy-latency trade-off. For instance, with SmolLM-360M on HellaSwag, our method reduces decoding latency (ms/token) by 17.75% relative to static KV quantization, improves accuracy by 7.60 points, and remains within only 0.30 points of FP16 inference.

Problem

Research questions and friction points this paper is trying to address.

KV-cache quantization

on-device LLMs

adaptive precision

memory overhead

accuracy degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive quantization

KV-cache

on-device LLMs