π€ AI Summary
This work addresses the vulnerability of user-sensitive data leakage in Transformer inference within machine learning as a service, where existing solutions based on fully homomorphic encryption (FHE) and secure multi-party computation (MPC) suffer from low efficiency, high communication overhead, and costly FHEβMPC conversions. To overcome these limitations, the authors propose EncFormer, a two-party collaborative framework for private Transformer inference that introduces a stage-compatible paradigm to optimize FHE kernel composition, thereby minimizing repacking and FHEβMPC switching. They formulate a minimal conversion-cost model to guide protocol boundary selection and design an efficient complex-number CKKS-to-MPC conversion alongside a communication-optimized MPC protocol for nonlinear operations, accelerated via GPU. Experiments demonstrate that EncFormer reduces online MPC communication by 1.4β30.4Γ and end-to-end latency by 1.3β9.8Γ over state-of-the-art hybrid FHEβMPC systems on GPT- and BERT-like models, while achieving 1.9β3.5Γ lower latency than pure FHE approaches on BERT-base with GLUE task accuracy nearly matching plaintext execution.
π Abstract
Transformer inference in machine-learning-as-a-service (MLaaS) raises privacy concerns for sensitive user inputs. Prior secure solutions that combine fully homomorphic encryption (FHE) and secure multiparty computation (MPC) are bottlenecked by inefficient FHE kernels, communication-heavy MPC protocols, and expensive FHE-MPC conversions. We present EncFormer, a two-party private Transformer inference framework that introduces Stage Compatible Patterns so that FHE kernels compose efficiently, reducing repacking and conversions. EncFormer also provides a cost analysis model built around a minimal-conversion baseline, enabling principled selection of FHE-MPC boundaries. To further reduce communication, EncFormer proposes a secure complex CKKS-MPC conversion protocol and designs communication-efficient MPC protocols for nonlinearities. With GPU optimizations, evaluations on GPT- and BERT-style models show that EncFormer achieves 1.4x-30.4x lower online MPC communication and 1.3x-9.8x lower end-to-end latency against prior hybrid FHE-MPC systems, and 1.9x-3.5x lower end-to-end latency on BERT-base than FHE-only pipelines under a matched backend, while maintaining near-plaintext accuracy on selected GLUE tasks.