🤖 AI Summary
To address the high latency, power consumption, and area overhead of modular arithmetic—critical for large prime testing and generation in cryptography—this paper proposes a low-latency, synthesizable hardware architecture dedicated to modular reduction. The method innovatively integrates carry-save addition (CSA) with deep optimization of the modular reduction path, enabling the first hardware modular unit achieving *O*(1) timing complexity. It employs combinational-logic-based CSA, segmented modular reduction, and RTL-level custom circuit design. Implemented in 65 nm CMOS technology, the proposed design reduces modular operation latency by 63% and area by 41% compared to the baseline. Consequently, it significantly accelerates thousand-bit prime testing and generation, providing an efficient hardware foundation for cryptographic acceleration on resource-constrained platforms.