🤖 AI Summary
Homomorphic encryption (HE) matrix multiplication (MM) incurs prohibitive computational overhead, severely hindering the practical deployment of privacy-preserving machine learning. This paper presents the first bandwidth-efficient FPGA accelerator specifically designed for HE MM. Our approach introduces two key innovations: (1) a fine-grained data-reuse datapath tailored to homomorphic linear transformations (HLTs), drastically reducing off-chip memory traffic and on-chip storage requirements; and (2) a fully configurable architecture supporting arbitrary matrix dimensions and a broad range of CKKS parameter sets. Implemented on the Xilinx Alveo U280, our accelerator achieves an average 221× speedup over a CPU baseline across realistic privacy-preserving ML workloads. It is the first to demonstrate feasibility and scalability of large-scale, consecutive HE MM operations. This work establishes a new design paradigm for HE accelerators, bridging a critical gap between theoretical security and practical efficiency.
📝 Abstract
Homomorphic Encryption (HE) enables secure computation on encrypted data, addressing privacy concerns in cloud computing. However, the high computational cost of HE operations, particularly matrix multiplication (MM), remains a major barrier to its practical deployment. Accelerating homomorphic encrypted MM (HE MM) is therefore crucial for applications such as privacy-preserving machine learning.
In this paper, we present a bandwidth-efficient FPGA implementation of HE MM. We first develop a cost model to evaluate the on-chip memory requirements for a given set of HE parameters and input matrix sizes. Our analysis shows that optimizing on-chip memory usage is critical for scalable and efficient HE MM. To this end, we design a novel datapath for Homomorphic Linear Transformation (HLT), the primary bottleneck in HE MM. The proposed datapath significantly reduces off-chip memory traffic and on-chip memory demand by enabling fine-grained data reuse. Leveraging this datapath, we introduce FAME, the first FPGA-based accelerator specifically tailored for HE MM. FAME supports arbitrary matrix shapes and is configurable across a wide range of HE parameter sets. We implement FAME on an Alveo U280 FPGA and evaluate its performance across diverse matrix sizes and shapes. Experimental results show that FAME achieves an average speedup of 221x over state-of-the-art CPU-based implementations, demonstrating its scalability and practicality for large-scale consecutive HE MM and real-world workloads.