π€ AI Summary
The core bottleneck of the FALCON post-quantum signature scheme lies in its computationally intensive discrete Gaussian sampling (DGS). To address this, we propose Bi-SamplerZ, a fully hardware-accelerated dual-path DGS architecture. Bi-SamplerZ introduces a novel cooperative dual-data-path design that synergistically exploits both the paired invocation pattern inherent to SamplerZ and the statistical correlation among rejection-sampling trials. It achieves high efficiency via fine-grained pipelining, dynamic control coordination, and tight ASIC/FPGA co-design. Compared to state-of-the-art implementations, Bi-SamplerZ reduces the per-sample latency by 54.1% and achieves the best areaβtime product (ATP). It delivers the lowest sampling latency on both FPGA and ASIC platforms, establishing a new hardware acceleration benchmark for FALCON.
π Abstract
FALCON is a standardized quantum-resistant digital signature scheme that offers advantages over other schemes, but features more complex signature generation process. This paper presents Bi-Samplerz, a fully hardware-implemented, high-efficiency dual-path discrete Gaussian sampler designed to accelerate Falcon signature generation. Observing that the SamplerZ subroutine is consistently invoked in pairs during each signature generation, we propose a dual-datapath architecture capable of generating two sampling results simultaneously. To make the best use of coefficient correlation and the inherent properties of rejection sampling, we introduce an assistance mechanism that enables effective collaboration between the two datapaths, rather than simply duplicating the sampling process. Additionally, we incorporate several architectural optimizations over existing designs to further enhance speed, area efficiency, and resource utilization. Experimental results demonstrate that Bi-SamplerZ achieves the lowest sampling latency to date among existing designs, benefiting from fine-grained pipeline optimization and efficient control coordination. Compared with the state-of-the-art full hardware implementations, Bi-SamplerZ reduces the sampling cycle count by 54.1% while incurring only a moderate increase in hardware resource consumption, thereby achieving the best-known area-time product (ATP) for fully hardware-based sampler designs. In addition, to facilitate comparison with existing works, we provide both ASIC and FPGA implementations. Together, these results highlight the suitability of Bi-SamplerZ as a high-performance sampling engine in standardized post-quantum cryptographic systems such as Falcon.