🤖 AI Summary
To address the low efficiency of synergistic homomorphic encryption (HE) and secure multi-party computation (MPC) in private Transformer inference—and particularly the high communication overhead incurred during cross-protocol conversions—this paper proposes BLB, a novel hybrid privacy-preserving inference framework. BLB introduces the first secure, low-overhead conversion protocol between the CKKS HE scheme and MPC. It decomposes Transformer layers at a fine-grained level and fuses linear operators to minimize interaction rounds and data transmission volume. Furthermore, it establishes a hybrid computation paradigm optimized for efficient encrypted matrix multiplication and Softmax evaluation. Experiments on BERT and GPT-2 demonstrate that BLB reduces communication overhead by 21× and latency by 13× compared to BOLT, and achieves 2× lower communication and 1.8× lower latency than Bumblebee. These results mark a significant advancement in the efficiency of privacy-preserving large-language-model inference.
📝 Abstract
This paper presents an efficient framework for private Transformer inference that combines Homomorphic Encryption (HE) and Secure Multi-party Computation (MPC) to protect data privacy. Existing methods often leverage HE for linear layers (e.g., matrix multiplications) and MPC for non-linear layers (e.g., Softmax activation functions), but the conversion between HE and MPC introduces significant communication costs. The proposed framework, dubbed BLB, overcomes this by breaking down layers into fine-grained operators and further fusing adjacent linear operators, reducing the need for HE/MPC conversions. To manage the increased ciphertext bit width from the fused linear operators, BLB proposes the first secure conversion protocol between CKKS and MPC and enables CKKS-based computation of the fused operators. Additionally, BLB proposes an efficient matrix multiplication protocol for fused computation in Transformers. Extensive evaluations on BERT-base, BERT-large, and GPT2-base show that BLB achieves a $21 imes$ reduction in communication overhead compared to BOLT (S&P'24) and a $2 imes$ reduction compared to Bumblebee (NDSS'25), along with latency reductions of $13 imes$ and $1.8 imes$, respectively, when leveraging GPU acceleration.