Privacy-Preserving Inference for Quantized BERT Models

📅 2025-08-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the efficiency bottleneck of secure inference for generative models in privacy-sensitive settings, this paper proposes a fine-grained hierarchical quantization framework that integrates a multi-input lookup table (LUT) protocol with dual secret sharing. The method enables efficient integer-only quantized inference by introducing a 1-bit weight fully connected layer and a LUT-based secure softmax computation, thereby eliminating truncation overhead from nonlinear function evaluation. Through hierarchical quantization, low-precision integer arithmetic, and co-optimization of cryptographic protocols, the approach significantly reduces both communication and computational costs. Experiments on BERT-base demonstrate that our method achieves 8×, 9×, and 22× speedups over Lu et al. (NDSS ’25), Gupta et al. (PETS ’24), and Knott et al. (NeurIPS ’21), respectively. This work establishes a scalable new paradigm for high-assurance private inference of generative AI models.

Technology Category

Application Category

📝 Abstract
With the increasing deployment of generative machine learning models in privacy-sensitive domains such as healthcare and personalized services, ensuring secure inference has become a critical challenge. Secure multi-party computation (MPC) enables privacy-preserving model inference but suffers from high communication and computation overhead. The main bottleneck lies in the expensive secure evaluation of floating-point operations. Quantization offers a promising solution by converting floating-point operations into lower-precision integer computations, significantly reducing overhead. However, existing MPC-based quantized inference methods either rely on public quantization parameters-posing privacy risks-or suffer from inefficiencies, particularly in handling nonlinear functions such as activations and softmax. In this work, we propose a fine-grained, layer-wise quantization scheme and support 1-bit weight fully connected layers in a secure setting. We design a multi-input lookup table protocol to evaluate softmax efficiently and securely. Furthermore, we use dual secret sharing schemes and perform precision conversions via lookup tables, eliminating truncation overhead entirely. Experimental evaluation on BERT-base models demonstrates that our approach achieves up to $8 imes$ speedup compared to Lu emph{et al}. (NDSS 25), $9 imes$ speedup compared to Gupta emph{et al}. (PETS 24) and $22 imes$ speedup compared to Knott emph{et al}. (NeurIPS 21).
Problem

Research questions and friction points this paper is trying to address.

Secure inference for quantized BERT models in privacy-sensitive domains
Reducing computation overhead in MPC-based quantized inference methods
Efficiently handling nonlinear functions like softmax in secure settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained layer-wise quantization scheme
Multi-input lookup table for softmax
Dual secret sharing for precision conversion
🔎 Similar Papers
No similar papers found.
T
Tianpei Lu
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, China
Bingsheng Zhang
Bingsheng Zhang
Zhejiang University, IOHK
Data SecurityCryptographyMulti-party computationBlockchainZero-knowledge proof
L
Lekun Peng
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, China
B
Bowen Zheng
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, China
Lichun Li
Lichun Li
Ant Group, China
Kui Ren
Kui Ren
Professor and Dean of Computer Science, Zhejiang University, ACM/IEEE Fellow
Data Security & PrivacyAI SecurityIoT & Vehicular Security