🤖 AI Summary
This work addresses the significant memory and bandwidth overhead of KV cache in on-device long-context inference with large language models by proposing an adaptive KV cache quantization method. Inspired by Huffman coding, it introduces a token-importance-aware variable-bit allocation mechanism that dynamically selects among {2/4/8-bit, FP16} precisions during decoding via a lightweight controller. The controller leverages low-overhead features—such as token frequency, quality scores, attention variance, and entropy—to enable efficient real-time decisions. Evaluated on the SmolLM model family, the approach substantially outperforms static and rule-based baselines: for instance, on SmolLM-360M, it reduces latency by 17.75% and improves accuracy by 7.60 points on HellaSwag, trailing FP16 by only 0.30 points.
📝 Abstract
Large Language Models (LLMs) have achieved remarkable progress across reasoning, generation, and decision-making tasks, yet deploying them on mobile, embedded, and edge devices remains particularly challenging. On-device LLM inference is heavily constrained by the memory and bandwidth overhead of the key-value (KV) cache, which grows linearly with context length and often dominates decoding cost. Existing KV-cache quantization schemes typically rely on fixed precision or hand-crafted heuristics, thereby wasting bits on low-impact tokens while over-compressing informative ones, leading to avoidable accuracy degradation. Inspired by Huffman coding's principle of variable-length allocation, we propose adaptive KV-cache quantization, a learned policy that assigns bit-width proportional to token importance, minimizing expected memory and latency without sacrificing competitive accuracy. Our framework extracts lightweight token-level features, including token frequency, quality score, attention variance, and entropy-based uncertainty, and feeds them into a compact data-driven controller that dynamically selects KV precision from {2-bit, 4-bit, 8-bit, FP16} during decoding. This adaptive precision policy reduces KV memory footprint and latency while improving accuracy compared to static KV quantization and rule-based baselines, and maintaining competitive accuracy close to FP16 inference across standard LLM benchmarks. Extensive experiments across multiple commonsense reasoning benchmarks using SmolLM-135M, SmolLM-360M, and SmolLM-1.7B demonstrate that our controller consistently improves the accuracy-latency trade-off. For instance, with SmolLM-360M on HellaSwag, our method reduces decoding latency (ms/token) by 17.75% relative to static KV quantization, improves accuracy by 7.60 points, and remains within only 0.30 points of FP16 inference.