🤖 AI Summary
To address the lack of efficient hardware support for SHA-3 on RISC-V architectures, this work proposes integrating the Keccak permutation core as a custom instruction into a general-purpose CPU microarchitecture—overcoming challenges arising from its multi-stage computation, irregular memory access patterns, and high fan-in/fan-out. Our approach employs pipelined permutation execution, register-level optimizations, and hardware reuse of lookup tables (LUTs). The design is validated via GEM5 simulation and FPGA prototyping. Results demonstrate an 8.02× improvement in SHA-3 throughput and a 46.31× speedup for Keccak alone over software-only implementations, with only ~15% additional register overhead and 11.51% LUT overhead. This work presents the first low-overhead, high-throughput native SHA-3 instruction support for RISC-V, establishing a scalable instruction-set extension paradigm for post-quantum cryptographic acceleration.
📝 Abstract
Integrating cryptographic accelerators into modern CPU architectures presents unique microarchitectural challenges, particularly when extending instruction sets with complex and multistage operations. Hardware-assisted cryptographic instructions, such as Intel's AES-NI and ARM's custom instructions for encryption workloads, have demonstrated substantial performance improvements. However, efficient SHA-3 acceleration remains an open problem due to its distinct permutation-based structure and memory access patterns. Existing solutions primarily rely on standalone coprocessors or software optimizations, often avoiding the complexities of direct microarchitectural integration. This study investigates the architectural challenges of embedding a SHA-3 permutation operation as a custom instruction within a general-purpose processor, focusing on pipelined simultaneous execution, storage utilization, and hardware cost. In this paper, we investigated and prototyped a SHA-3 custom instruction for the RISC-V CPU architecture. Using cycle-accurate GEM5 simulations and FPGA prototyping, our results demonstrate performance improvements of up to 8.02x for RISC-V optimized SHA-3 software workloads and up to 46.31x for Keccak-specific software workloads, with only a 15.09% increase in registers and a 11.51% increase in LUT utilization. These findings provide critical insights into the feasibility and impact of SHA-3 acceleration at the microarchitectural level, highlighting practical design considerations for future cryptographic instruction set extensions.