Reimagining Memory Access for LLM Inference: Compression-Aware Memory Controller Design

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

250K/year

🤖 AI Summary

To address memory bandwidth and capacity bottlenecks limiting inference efficiency in large language models (LLMs), this work proposes an LLM-aware on-chip memory controller architecture. The method integrates hardware-accelerated lossless block compression (LZ4/ZSTD) with fine-grained bit-level accessibility control in the memory controller, synergistically coordinated with a context-adaptive dynamic quantization engine for weights and KV caches. This enables joint compression–quantization optimization, preserving inference accuracy without loss while dynamically scaling memory bandwidth and energy consumption according to input context length. Experimental results demonstrate 25.2% reduction in weight storage and 46.9% reduction in KV cache size. Implemented in a 7 nm process, the prototype achieves 8 TB/s throughput at 4 GHz with 32 lanes, incurring an area overhead of less than 3.8 mm².

Technology Category

Application Category

📝 Abstract

The efficiency of Large Language Model~(LLM) inference is often constrained by substantial memory bandwidth and capacity demands. Existing techniques, such as pruning, quantization, and mixture of experts/depth, reduce memory capacity and/or bandwidth consumption at the cost of slight degradation in inference quality. This paper introduces a design solution that further alleviates memory bottlenecks by enhancing the on-chip memory controller in AI accelerators to achieve two main objectives: (1) significantly reducing memory capacity and bandwidth usage through lossless block compression~(e.g., LZ4 and ZSTD) of model weights and key-value (KV) cache without compromising inference quality, and (2) enabling memory bandwidth and energy consumption to scale proportionally with context-dependent dynamic quantization. These goals are accomplished by equipping the on-chip memory controller with mechanisms to improve fine-grained bit-level accessibility and compressibility of weights and KV cache through LLM-aware configuration of in-memory placement and representation. Experimental results on publicly available LLMs demonstrate the effectiveness of this approach, showing memory footprint reductions of 25.2% for model weights and 46.9% for KV cache. In addition, our hardware prototype at 4,GHz and 32 lanes (7,nm) achieves 8,TB/s throughput with a modest area overhead (under 3.8,mm(^2)), which underscores the viability of LLM-aware memory control as a key to efficient large-scale inference.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory bandwidth and capacity demands for LLM inference

Enhancing on-chip memory controller for lossless weight and KV cache compression

Scaling memory bandwidth and energy with dynamic quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lossless block compression for weights and KV cache

Dynamic quantization scaling with context

LLM-aware in-memory placement and representation

🔎 Similar Papers

No similar papers found.