🤖 AI Summary
This work addresses the computational bottleneck of multi-scalar multiplication (MSM) in zero-knowledge proofs (ZKPs). To accelerate MSM for 377-bit elliptic curve cryptography—characterized by high-precision arithmetic and strong data dependencies—we propose a hardware acceleration framework tailored to the Xilinx Versal ACAP platform. We introduce, for the first time, an AI Engine (AIE)-optimized point addition (PADD) design featuring a carry-save–style carry-propagation algorithm aligned with VLIW/SIMD instruction-level parallelism, and systematically evaluate four spatial mapping strategies to maximize on-chip task parallelism. Leveraging custom assembly programming, optimized arbitrary-precision arithmetic, and fine-grained memory access scheduling, our implementation achieves 50.2% of theoretical memory bandwidth utilization—568× higher than a state-of-the-art CPU baseline. This establishes a new performance benchmark for ZKP hardware acceleration.
📝 Abstract
Multi-scalar multiplication (MSM) is crucial in cryptographic applications and computationally intensive in zero-knowledge proofs. MSM involves accumulating the products of scalars and points on an elliptic curve over a 377-bit modulus, and the Pippenger algorithm converts MSM into a series of elliptic curve point additions (PADDs) with high parallelism. This study investigates accelerating MSM on the Versal ACAP platform, an emerging hardware that employs a spatial architecture integrating 400 AI Engines (AIEs) with programmable logic and a processing system. AIEs are SIMD-based VLIW processors capable of performing vector multiply-accumulate operations, making them well-suited for multiplication-heavy workloads in PADD. Unlike simpler multiplication tasks in previous studies, cryptographic computations also require complex operations such as carry propagation. These operations necessitate architecture-aware optimizations, including intra-core dedicated coding style to fully exploit VLIW capabilities and inter-core strategy for spatial task mapping. We propose various optimizations to accelerate PADDs, including (1) algorithmic optimizations for carry propagation employing a carry-save-like technique to exploit VLIW and SIMD capabilities and (2) a comparison of four distinct spatial mappings to enhance intra- and inter-task parallelism. Our approach achieves a computational efficiency that utilizes 50.2% of the theoretical memory bandwidth and provides 568 speedup over the integrated CPU on the AIE evaluation board.