UniCAIM: A Unified CAM/CIM Architecture with Static-Dynamic KV Cache Pruning for Efficient Long-Context LLM Inference

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high memory and computational overhead induced by KV caching in long-context LLM inference, this work proposes a unified hardware architecture integrating content-addressable memory (CAM) and compute-in-memory (CIM). It introduces a novel tri-modal ferroelectric field-effect transistor (FeFET) device enabling concurrent static and dynamic KV cache pruning: CAM-mode supports O(1) approximate similarity search, while charge-domain and current-domain operations respectively realize efficient accumulation and precise attention computation. A multi-level signed multi-bit FeFET memory array is designed to co-integrate in-situ attention computation and approximate similarity measurement. At the circuit level, the architecture reduces area-energy-delay product (AEDP) by 8.2×–831×; at the application level, it achieves accuracy comparable to dense attention, significantly improving energy efficiency for long-text inference.

Technology Category

Application Category

📝 Abstract
Transformer-based large language models (LLMs) have achieved impressive performance in various natural language processing (NLP) applications. However, the high memory and computation cost induced by the KV cache limits the inference efficiency, especially for long input sequences. Compute-in-memory (CIM)-based accelerators have been proposed for LLM acceleration with KV cache pruning. However, as existing accelerators only support static pruning with a fixed pattern or dynamic pruning with primitive implementations, they suffer from either high accuracy degradation or low efficiency. In this paper, we propose a ferroelectric FET (FeFET)-based unified content addressable memory (CAM) and CIM architecture, dubbed as UniCAIM. UniCAIM features simultaneous support for static and dynamic pruning with 3 computation modes: 1) in the CAM mode, UniCAIM enables approximate similarity measurement in O(1) time for dynamic KV cache pruning with high energy efficiency; 2) in the charge-domain CIM mode, static pruning can be supported based on accumulative similarity score, which is much more flexible compared to fixed patterns; 3) in the current-domain mode, exact attention computation can be conducted with a subset of selected KV cache. We further propose a novel CAM/CIM cell design that leverages the multi-level characteristics of FeFETs for signed multibit storage of the KV cache and in-place attention computation. With extensive experimental results, we demonstrate UniCAIM can reduce the area-energy-delay product (AEDP) by 8.2-831x over the state-ofthe-art CIM-based LLM accelerators at the circuit level, along with high accuracy comparable with dense attention at the application level, showing its great potential for efficient long-context LLM inference.
Problem

Research questions and friction points this paper is trying to address.

Reduces memory and computation cost of KV cache in LLMs
Supports static and dynamic KV cache pruning efficiently
Improves accuracy and efficiency in long-context LLM inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified CAM/CIM architecture for KV cache pruning
Static-dynamic pruning with three computation modes
FeFET-based multi-level cell for efficient attention computation
🔎 Similar Papers
No similar papers found.
Weikai Xu
Weikai Xu
Department Communication Engineering, Xiamen University
Chaos CommunicationsWireless Communications
Wenxuan Zeng
Wenxuan Zeng
Peking University
Efficient Deep LearningLarge Language Model
Qianqian Huang
Qianqian Huang
Institute of Microelectronics, Peking University
Microelectronics
M
Meng Li
Institute for Artificial Intelligence, Peking University, China; Beijing Advanced Innovation Center for Integrated Circuits, Beijing, China
R
Ru Huang
Beijing Advanced Innovation Center for Integrated Circuits, Beijing, China