🤖 AI Summary
This work addresses the inefficiency of conventional compute-in-memory (CIM) accelerators for Transformer self-attention, where dynamic operands necessitate frequent reprogramming of non-volatile memory, degrading throughput and device endurance. To overcome this, the authors propose TrilinearCIM, a novel architecture based on dual-gate ferroelectric field-effect transistors (FeFETs) that leverages back-gate modulation to realize a three-operand multiply-accumulate primitive directly within non-volatile memory. This enables full execution of the Transformer attention mechanism without runtime reprogramming for the first time. Evaluations on BERT-base and ViT-base demonstrate that TrilinearCIM reduces energy consumption by up to 46.6% and latency by 20.4% compared to conventional FeFET CIM, achieves superior performance on seven out of nine GLUE tasks, and incurs an area overhead of 37.3%.
📝 Abstract
Self-attention in Transformers generates dynamic operands that force conventional Compute-in-Memory (CIM) accelerators into costly non-volatile memory (NVM) reprogramming cycles, degrading throughput and stressing device endurance. Existing solutions either reduce but retain NVM writes through matrix decomposition or sparsity, or move attention computation to digital CMOS at the expense of NVM density. We present TrilinearCIM, a Double-Gate FeFET (DG-FeFET)-based architecture that uses back-gate modulation to realize a three-operand multiply-accumulate primitive for in-memory attention computation without dynamic ferroelectric reprogramming. Evaluated on BERT-base (GLUE) and ViT-base (ImageNet and CIFAR), TrilinearCIM outperforms conventional CIM on seven of nine GLUE tasks while achieving up to 46.6\% energy reduction and 20.4\% latency improvement over conventional FeFET CIM at 37.3\% area overhead. To our knowledge, this is the first architecture to perform complete Transformer attention computation exclusively in NVM cores without runtime reprogramming.