๐ค AI Summary
This work addresses the computational intensity, low energy efficiency, and insufficient on-chip data reuse in large language model (LLM) inference by proposing a fusion-driven compute-in-memory (CIM) architecture. It co-designs the attention mechanism by synergistically fusing QKแต and PV computations, innovatively integrating input-side (IP-CIM) and output-side (OP-CIM) compute-in-memory paradigms, and introducing a QO-stationary dataflow to maximize on-chip data reuse. Additionally, a pattern-aware online Softmax mechanism is incorporated to substantially reduce the overhead of nonlinear operations. Experimental evaluation on the LLaMA-3 model demonstrates that the proposed architecture achieves up to 1.98ร speedup and 3.86ร energy savings, attaining a system energy efficiency of 29.4 TOPS/W.
๐ Abstract
In this paper, we propose FusionCIM, an operator-fusion-driven compute-in-memory (CIM) accelerator architecture for efficient and scalable LLM inference, with three key innovations: (1) a hybrid CIM pipeline architecture that maps QKT computation on inner-product-based CIM (IP-CIM) and PV aggregation on outer-product-based CIM (OP-CIM) for efficient matrix multiplications fusion; (2) a QO-stationary dataflow that eliminates repeated KV loading in CIM and K-matrix access in buffer under transpose fusion, significantly improving data reuse on chip; and (3) a pattern-aware online-softmax mechanism that exploits distribution regularities of attention scores to reduce exponential rescaling overhead for non-linear fusion. Experimental results on LLaMA-3 model show that FusionCIM achieves up to 3.86x energy saving, and 1.98x speedup compared with prior SOTA CIM-based designs with 29.4 TOPS/W energy efficiency at the system level.