🤖 AI Summary
Current digital Compute-in-Memory (CIM) architectures suffer from high latency and low reliability in multiply-accumulate operations—particularly in bit-serial designs, where computational fidelity is significantly inferior to memory access fidelity. To address this fundamental bottleneck, this work proposes a technology-agnostic, high-radix parallel counting paradigm that maps arithmetic operations directly onto memory-primitive bit-level logic operations. Leveraging commodity DRAM, we implement a fault-tolerant in-memory multiplication architecture supporting Hamming and BCH error-correcting codes. Compared to state-of-the-art DRAM-CIM approaches, our design achieves up to 10× speedup, delivers 8 GOPS/W energy efficiency, improves area efficiency by 9.5×, and surpasses GPU performance in vector-matrix multiplication. The framework decouples computation from memory technology constraints, enabling scalable, reliable, and high-throughput CIM without custom hardware.
📝 Abstract
Computing-in-memory (CIM) has been demonstrated across various memory technologies, ranging from memristive crossbars performing analog dot-product computations to large-scale digital bitwise operations in commodity DRAM and other proposed non-volative memory technologies. However, current CIM solutions face latency and reliability challenges. CIM fidelity lags considerably behind access fidelity. Furthermore, bulk-bitwise CIM, although highly parallelized, requires long latency for operations like multiplication and addition, due to their bit-serial computation. This paper presents Count2Multiply, a technology-agnostic digital CIM approach to perform multiplication, addition and other operations using high-radix, massively parallel counting enabled by CIM bulk-bitwise logic operations. Designed to meet fault tolerance requirements, Count2Multiply integrates traditional row-wise error correction codes, such as Hamming and BCH, to address the high error rates in existing CIM designs. We demonstrate Count2Multiply with a detailed application to CIM in conventional DRAM due to its ubiquity and high endurance. However, we note that the Count2Multiply architecture is compatible with other functionally complete CIM proposals. Compared to the state-of-the-art in-DRAM CIM method, Count2Multiply achieves up to 10x speedup, 8x higher GOPS/Watt, and 9.5x higher GOPS/area, while outperforming GPU for vector-matrix multiplications.