🤖 AI Summary
To address the high overhead of exponent alignment and the latency of bit-serial input in floating-point compute-in-memory (FP-CIM), this work proposes a synergistic optimization framework comprising Segmented Exponent Alignment (SEA) and Dynamic Wordline Activation (DWA). By modeling the clustered distribution of input exponents, SEA constructs a segmented exponent space that eliminates the need for global maximum-exponent detection, enabling on-demand wordline activation and substantially shortening the analog-domain floating-point multiply-accumulate (FP-MAC) hardware execution path. The proposed method preserves high computational accuracy while significantly reducing energy consumption and latency: on the VGG16-CIFAR10 benchmark, it achieves 63.8% lower power consumption and 40.87% lower latency compared to conventional FP-CIM approaches. This work establishes a novel architectural paradigm for efficient, low-overhead FP-CIM systems.
📝 Abstract
With the rise of compute-in-memory (CIM) accelerators, floating-point multiply-and-accumulate (FP-MAC) operations have gained extensive attention for their higher accuracy over integer MACs in neural networks. However, the hardware overhead caused by exponent comparison and mantissa alignment, along with the delay introduced by bit-serial input methods, remains a hinder to implement FP-MAC efficiently. In view of this, we propose Segmented Exponent Alignment (SEA) and Dynamic Wordline Activation (DWA) strategies. SEA exploits the observation that input exponents are often clustered around zero or within a narrow range. By segmenting the exponent space and aligning mantissas accordingly, SEA eliminates the need for maximum exponent detection and reduces input mantissa shifting, and thus reduces the processing latency. DWA further reduces latency and maintains accuracy by activating wordlines based on the exponent segments defined by SEA. Simulation results demonstrate that, when compared with conventional comparison tree based maximum exponent alignment method, our approach saves 63.8% power consumption, and achieves a 40.87% delay reduction on the VGG16-CIFAR10 benchmark.