Theodosian: A Deep Dive into Memory-Hierarchy-Centric FHE Acceleration

📅 2025-12-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address memory bandwidth bottlenecks and low hardware utilization in CKKS fully homomorphic encryption (FHE) on GPUs—caused by inefficient cache usage and insufficient intra-kernel parallelism—this paper presents the first systematic characterization of CKKS performance bottlenecks across modern GPU memory hierarchies. We propose Theodosian, a memory-aware co-optimization framework that integrates CUDA fine-grained cache optimization, kernel fusion, data layout restructuring, instruction-level pipeline reordering, and CKKS-specific algorithmic simplifications. Evaluated on an RTX 5090, Theodosian reduces the latency of complex-valued bootstrapping for dimension-32768 ciphertexts to 12.8 ms—the fastest reported GPU-based FHE bootstrapping to date. This breakthrough significantly alleviates the memory wall constraint and delivers substantial end-to-end throughput improvement, establishing a new state-of-the-art in GPU-accelerated FHE.

Technology Category

Application Category

📝 Abstract
Fully homomorphic encryption (FHE) enables secure computation on encrypted data, mitigating privacy concerns in cloud and edge environments. However, due to its high compute and memory demands, extensive acceleration research has been pursued across diverse hardware platforms, especially GPUs. In this paper, we perform a microarchitectural analysis of CKKS, a popular FHE scheme, on modern GPUs. We focus on on-chip cache behavior, and show that the dominant kernels remain bound by memory bandwidth despite a high-bandwidth L2 cache, exposing a persistent memory wall. We further discover that the overall CKKS pipeline throughput is constrained by low per-kernel hardware utilization, caused by insufficient intra-kernel parallelism. Motivated by these findings, we introduce Theodosian, a set of complementary, memory-aware optimizations that improve cache efficiency and reduce runtime overheads. Our approach delivers consistent speedups across various CKKS workloads. On an RTX 5090, we reduce the bootstrapping latency for 32,768 complex numbers to 15.2ms with Theodosian, and further to 12.8ms with additional algorithmic optimizations, establishing new state-of-the-art GPU performance to the best of our knowledge.
Problem

Research questions and friction points this paper is trying to address.

Accelerates FHE computations on GPUs
Addresses memory bandwidth and cache inefficiencies
Improves kernel parallelism and hardware utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-aware optimizations for cache efficiency
Reducing runtime overheads in CKKS pipeline
Improving intra-kernel parallelism on GPUs
🔎 Similar Papers
No similar papers found.
Wonseok Choi
Wonseok Choi
PhD Student, POSTECH
vision language modelmodel evaluationcomputer vision
H
Hyunah Yu
Seoul National University
J
Jongmin Kim
Seoul National University
H
Hyesung Ji
Seoul National University
J
Jaiyoung Park
Seoul National University
Jung Ho Ahn
Jung Ho Ahn
Seoul National University
Computer Architecture