🤖 AI Summary
Fully Homomorphic Encryption (FHE) suffers from prohibitive computational and memory overheads, hindering efficient GPU deployment and limiting its practicality in cloud-based privacy-preserving computation. This paper introduces the first CUDA-optimized, high-performance FHE library. Our approach features: (i) a novel 32-bit compact ciphertext representation and associated arithmetic design; (ii) algorithm–architecture co-optimization, including custom low-precision Number-Theoretic Transform (NTT) and Fast Fourier Transform (FFT) kernels, and hierarchical FHE primitive implementations; and (iii) full-stack re-engineering of critical execution paths—particularly base conversion and ciphertext multiply-accumulate operations. Experimental evaluation on mainstream GPUs demonstrates 2.9×–25.6× speedup over state-of-the-art GPU-accelerated FHE libraries for typical ciphertext operations. This substantially narrows the performance gap between FHE and plaintext computation, establishing a new infrastructure foundation for practical privacy-enhancing technologies.
📝 Abstract
Fully homomorphic encryption (FHE) is a cryptographic technology capable of resolving security and privacy problems in cloud computing by encrypting data in use. However, FHE introduces tremendous computational overhead for processing encrypted data, causing FHE workloads to become 2-6 orders of magnitude slower than their unencrypted counterparts. To mitigate the overhead, we propose Cheddar, an FHE library for CUDA GPUs, which demonstrates significantly faster performance compared to prior GPU implementations. We develop optimized functionalities at various implementation levels ranging from efficient low-level primitives to streamlined high-level operational sequences. Especially, we improve major FHE operations, including number-theoretic transform and base conversion, based on efficient kernel designs using a small word size of 32 bits. By these means, Cheddar demonstrates 2.9 to 25.6 times higher performance for representative FHE workloads compared to prior GPU implementations.