GPU Acceleration of TFHE-Based High-Precision Nonlinear Layers for Encrypted LLM Inference

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the inefficiency of existing fully homomorphic encryption (FHE) schemes in evaluating high-precision nonlinear functions, which hinders privacy-preserving inference for large language models. To overcome this limitation, we propose TIGER, a framework that achieves the first GPU-accelerated high-precision nonlinear layers in TFHE, breaking through the native lookup-table precision barrier. By integrating GPU-optimized WoP-PBS, advanced numerical approximation algorithms, and a batch-driven architecture, TIGER efficiently supports critical operations such as GELU, Softmax, and LayerNorm. Compared to CPU baselines, our approach delivers speedups of 7.17×, 16.68×, and 17.05× for these respective operations, substantially enhancing the practicality of encrypted inference.

Technology Category

Application Category

📝 Abstract

Deploying large language models (LLMs) as cloud services raises privacy concerns as inference may leak sensitive data. Fully Homomorphic Encryption (FHE) allows computation on encrypted data, but current FHE methods struggle with efficient and precise nonlinear function evaluation. Specifically, CKKS-based approaches require high-degree polynomial approximations, which are costly when target precision increases. Alternatively, TFHE's Programmable Bootstrapping (PBS) outperforms CKKS by offering exact lookup-table evaluation. But it lacks high-precision implementations of LLM nonlinear layers and underutilizes GPU resources. We propose \emph{TIGER}, the first GPU-accelerated framework for high-precision TFHE-based nonlinear LLM layer evaluation. TIGER offers: (1) GPU-optimized WoP-PBS method combined with numerical algorithms to surpass native lookup-table precision limits on nonlinear functions; (2) high-precision and efficient implementations of key nonlinear layers, enabling practical encrypted inference; (3) batch-driven design exploiting inter-input parallelism to boost GPU efficiency. TIGER achieves 7.17$\times$, 16.68$\times$, and 17.05$\times$ speedups over a CPU baseline for GELU, Softmax, and LayerNorm, respectively.

Problem

Research questions and friction points this paper is trying to address.

Fully Homomorphic Encryption

TFHE

Nonlinear Layers

Encrypted LLM Inference

GPU Acceleration

Innovation

Methods, ideas, or system contributions that make the work stand out.

TFHE

GPU acceleration

Programmable Bootstrapping