🤖 AI Summary
Existing in-memory computing (CiM) architectures struggle to efficiently support large-integer modular multiplication—critical for cryptographic primitives such as RSA and ECC—due to two key limitations: (1) inherent bias toward low-bitwidth operations, hindering scalability to high-precision arithmetic; and (2) reliance on inefficient in-memory logic, resulting in high latency and substantial area overhead for wide-bit operands. This work presents the first mapping of the Barrett modular reduction algorithm onto an SRAM-based CiM architecture. We propose a workload-partitioning–driven, customized dataflow and computational optimization strategy to enhance scalability and energy efficiency at high bitwidths. Experimental results demonstrate that our design achieves a 7.02× speedup over state-of-the-art SRAM CiM accelerators, while significantly reducing both latency and area cost. The approach establishes a scalable, hardware-efficient paradigm for accelerating cryptographic workloads in memory.
📝 Abstract
Barrett's algorithm is one of the most widely used methods for performing modular multiplication, a critical nonlinear operation in modern privacy computing techniques such as homomorphic encryption (HE) and zero-knowledge proofs (ZKP). Since modular multiplication dominates the processing time in these applications, computational complexity and memory limitations significantly impact performance. Computing-in-Memory (CiM) is a promising approach to tackle this problem. However, existing schemes currently suffer from two main problems: 1) Most works focus on low bit-width modular multiplication, which is inadequate for mainstream cryptographic algorithms such as elliptic curve cryptography (ECC) and the RSA algorithm, both of which require high bit-width operations; 2) Recent efforts targeting large number modular multiplication rely on inefficient in-memory logic operations, resulting in high scaling costs for larger bit-widths and increased latency. To address these issues, we propose LaMoS, an efficient SRAM-based CiM design for large-number modular multiplication, offering high scalability and area efficiency. First, we analyze the Barrett's modular multiplication method and map the workload onto SRAM CiM macros for high bit-width cases. Additionally, we develop an efficient CiM architecture and dataflow to optimize large-number modular multiplication. Finally, we refine the mapping scheme for better scalability in high bit-width scenarios using workload grouping. Experimental results show that LaMoS achieves a $7.02 imes$ speedup and reduces high bit-width scaling costs compared to existing SRAM-based CiM designs.