Towards Closing the Performance Gap for Cryptographic Kernels Between CPUs and Specialized Hardware

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

General-purpose CPUs suffer severe performance bottlenecks in cryptographic kernel computations—particularly Number-Theoretic Transform (NTT) and Basic Linear Algebra Subprograms (BLAS)—compared to application-specific integrated circuits (ASICs). Method: This work proposes a hardware–software co-design approach: we introduce MQX, a lightweight x86 hardware extension comprising only three AVX-512 multi-word arithmetic instructions to accelerate large-integer arithmetic; complemented by scalar optimizations and adaptive AVX2/AVX-512 SIMD scheduling for multi-precision computation. Contribution/Results: On a single CPU core, NTT and BLAS throughput improve by 38× and 62×, respectively, reducing the performance gap with ASICs to just 35×. Roofline analysis confirms that the optimized CPU now operates near the computational efficiency ceiling of ASICs. To our knowledge, this is the first work to achieve ASIC-level cryptographic computation performance on commodity CPUs with minimal hardware overhead—only three new instructions.

Technology Category

Application Category

📝 Abstract

Specialized hardware like application-specific integrated circuits (ASICs) remains the primary accelerator type for cryptographic kernels based on large integer arithmetic. Prior work has shown that commodity and server-class GPUs can achieve near-ASIC performance for these workloads. However, achieving comparable performance on CPUs remains an open challenge. This work investigates the following question: How can we narrow the performance gap between CPUs and specialized hardware for key cryptographic kernels like basic linear algebra subprograms (BLAS) operations and the number theoretic transform (NTT)? To this end, we develop an optimized scalar implementation of these kernels for x86 CPUs at the per-core level. We utilize SIMD instructions (specifically AVX2 and AVX-512) to further improve performance, achieving an average speedup of 38 times and 62 times over state-of-the-art CPU baselines for NTTs and BLAS operations, respectively. To narrow the gap further, we propose a small AVX-512 extension, dubbed multi-word extension (MQX), which delivers substantial speedup with only three new instructions and minimal proposed hardware modifications. MQX cuts the slowdown relative to ASICs to as low as 35 times on a single CPU core. Finally, we perform a roofline analysis to evaluate the peak performance achievable with MQX when scaled across an entire multi-core CPU. Our results show that, with MQX, top-tier server-grade CPUs can approach the performance of state-of-the-art ASICs for cryptographic workloads.

Problem

Research questions and friction points this paper is trying to address.

Narrowing performance gap between CPUs and ASICs for cryptography

Optimizing cryptographic kernels like NTT and BLAS on CPUs

Developing CPU extensions to achieve near-ASIC performance levels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimized scalar implementation for x86 CPUs

Utilized SIMD instructions for performance boost

Proposed MQX extension with new instructions

🔎 Similar Papers

Cheddar: A Swift Fully Homomorphic Encryption Library for CUDA GPUs