Trilinear Compute-in-Memory Architecture for Energy-Efficient Transformer Acceleration

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the inefficiency of conventional compute-in-memory (CIM) accelerators for Transformer self-attention, where dynamic operands necessitate frequent reprogramming of non-volatile memory, degrading throughput and device endurance. To overcome this, the authors propose TrilinearCIM, a novel architecture based on dual-gate ferroelectric field-effect transistors (FeFETs) that leverages back-gate modulation to realize a three-operand multiply-accumulate primitive directly within non-volatile memory. This enables full execution of the Transformer attention mechanism without runtime reprogramming for the first time. Evaluations on BERT-base and ViT-base demonstrate that TrilinearCIM reduces energy consumption by up to 46.6% and latency by 20.4% compared to conventional FeFET CIM, achieves superior performance on seven out of nine GLUE tasks, and incurs an area overhead of 37.3%.

Technology Category

Application Category

📝 Abstract

Self-attention in Transformers generates dynamic operands that force conventional Compute-in-Memory (CIM) accelerators into costly non-volatile memory (NVM) reprogramming cycles, degrading throughput and stressing device endurance. Existing solutions either reduce but retain NVM writes through matrix decomposition or sparsity, or move attention computation to digital CMOS at the expense of NVM density. We present TrilinearCIM, a Double-Gate FeFET (DG-FeFET)-based architecture that uses back-gate modulation to realize a three-operand multiply-accumulate primitive for in-memory attention computation without dynamic ferroelectric reprogramming. Evaluated on BERT-base (GLUE) and ViT-base (ImageNet and CIFAR), TrilinearCIM outperforms conventional CIM on seven of nine GLUE tasks while achieving up to 46.6\% energy reduction and 20.4\% latency improvement over conventional FeFET CIM at 37.3\% area overhead. To our knowledge, this is the first architecture to perform complete Transformer attention computation exclusively in NVM cores without runtime reprogramming.

Problem

Research questions and friction points this paper is trying to address.

Compute-in-Memory

Transformer

Self-attention

Non-volatile Memory

Reprogramming

Innovation

Methods, ideas, or system contributions that make the work stand out.

TrilinearCIM

Compute-in-Memory

DG-FeFET