🤖 AI Summary
To address the prohibitively high computational overhead of homomorphic encryption (HE) for large language model (LLM) inference in privacy-sensitive settings, this paper proposes the first HE-friendly lightweight LLM architecture that integrates LoRA fine-tuning with Gaussian kernel approximation. The method introduces quantization-aware attention and feed-forward network (FFN) designs to enable end-to-end secure inference after private fine-tuning. Its core innovation lies in embedding LoRA adapters into a Gaussian kernel-approximated Transformer, drastically reducing polynomial evaluation complexity under HE. Experiments demonstrate a 6.94× speedup in fine-tuning, a 2.3× acceleration in HE-based inference, and negligible accuracy degradation (<1% relative to plaintext baselines). The implementation is publicly available.
📝 Abstract
Large language models (LLMs) offer personalized responses based on user interactions, but this use case raises serious privacy concerns. Homomorphic encryption (HE) is a cryptographic protocol supporting arithmetic computations in encrypted states and provides a potential solution for privacy-preserving machine learning (PPML). However, the computational intensity of transformers poses challenges for applying HE to LLMs. In this work, we propose a modified HE-friendly transformer architecture with an emphasis on inference following personalized (private) fine-tuning. Utilizing LoRA fine-tuning and Gaussian kernels, we achieve significant computational speedups -- 6.94x for fine-tuning and 2.3x for inference -- while maintaining performance comparable to plaintext models. Our findings provide a viable proof of concept for offering privacy-preserving LLM services in areas where data protection is crucial. Our code is available on GitHub.