🤖 AI Summary
Homomorphic encryption (HE) faces practical deployment challenges in cloud environments due to prohibitively high computational overhead and the cost of domain-specific accelerators. To address this, this paper introduces the first HE compilation optimization framework tailored for AI accelerators—specifically, Google’s TPUv4. Our method bridges the semantic gap between HE arithmetic and AI hardware by rethinking HE operations through the lens of dense matrix computation. Key contributions include: (1) the first adaptation of HE modular multiplication and high-precision arithmetic to the matrix-centric execution model of AI chips; (2) three novel compilation mapping techniques—BARRETT modular reduction, Basis Aligned Transformation (BAT), and Matrix Aligned Transformation (MAT). Implemented atop the CROSS compiler, our framework achieves up to 161× speedup over multi-core CPUs and 5× over NVIDIA V100 GPUs for core HE operators on TPUv4. All optimized kernels are open-sourced.
📝 Abstract
Cloud-based services are making the outsourcing of sensitive client data increasingly common. Although homomorphic encryption (HE) offers strong privacy guarantee, it requires substantially more resources than computing on plaintext, often leading to unacceptably large latencies in getting the results. HE accelerators have emerged to mitigate this latency issue, but with the high cost of ASICs. In this paper we show that HE primitives can be converted to AI operators and accelerated on existing ASIC AI accelerators, like TPUs, which are already widely deployed in the cloud. Adapting such accelerators for HE requires (1) supporting modular multiplication, (2) high-precision arithmetic in software, and (3) efficient mapping on matrix engines. We introduce the CROSS compiler (1) to adopt Barrett reduction to provide modular reduction support using multiplier and adder, (2) Basis Aligned Transformation (BAT) to convert high-precision multiplication as low-precision matrix-vector multiplication, (3) Matrix Aligned Transformation (MAT) to covert vectorized modular operation with reduction into matrix multiplication that can be efficiently processed on 2D spatial matrix engine. Our evaluation of CROSS on a Google TPUv4 demonstrates significant performance improvements, with up to 161x and 5x speedup compared to the previous work on many-core CPUs and V100. The kernel-level codes are open-sourced at https://github.com/google/jaxite.git.