🤖 AI Summary
In high-security cryptographic applications—such as homomorphic encryption and zero-knowledge proofs—large-bitwidth modular reduction on FPGAs incurs excessive area overhead due to reliance on large lookup tables (LUTs). To address this, we propose an area-efficient hybrid iterative LUT implementation. Our approach innovatively integrates iteration into the LUT architecture via dual-path workload partitioning and dynamic result fusion, achieving significant resource reduction without compromising throughput. Furthermore, we introduce a configurable template and a timing-driven design-space exploration methodology to automate optimal fusion-point selection. Experimental evaluation demonstrates that our design improves LUT area efficiency by 1.65× for 128-bit and 3× for 8192-bit modular reduction, respectively—substantially outperforming state-of-the-art approaches.
📝 Abstract
Modular arithmetic, particularly modular reduction, is widely used in cryptographic applications such as homomorphic encryption (HE) and zero-knowledge proofs (ZKP). High-bit-width operations are crucial for enhancing security; however, they are computationally intensive due to the large number of modular operations required. The lookup-table-based (LUT-based) approach, a ``space-for-time'' technique, reduces computational load by segmenting the input number into smaller bit groups, pre-computing modular reduction results for each segment, and storing these results in LUTs. While effective, this method incurs significant hardware overhead due to extensive LUT usage. In this paper, we introduce ALLMod, a novel approach that improves the area efficiency of LUT-based large-number modular reduction by employing hybrid workloads. Inspired by the iterative method, ALLMod splits the bit groups into two distinct workloads, achieving lower area costs without compromising throughput. We first develop a template to facilitate workload splitting and ensure balanced distribution. Then, we conduct design space exploration to evaluate the optimal timing for fusing workload results, enabling us to identify the most efficient design under specific constraints. Extensive evaluations show that ALLMod achieves up to $1.65 imes$ and $3 imes$ improvements in area efficiency over conventional LUT-based methods for bit-widths of $128$ and $8,192$, respectively.