🤖 AI Summary
This paper addresses the dual challenges of low efficiency and poor scalability to long sequences in privacy-preserving Transformer inference. We propose the first end-to-end encrypted inference framework for Transformers. Our method introduces three core innovations: (1) a layer-wise progressive encrypted token pruning protocol that dynamically eliminates redundant input tokens; (2) an adaptive encrypted polynomial degree-reduction protocol that lowers the approximation order of nonlinear activations for non-critical tokens; and (3) a protocol-aware, gradient-driven neural architecture search that jointly optimizes pruning thresholds and degree-reduction conditions. The framework is compatible with both secure multi-party computation (MPC) and homomorphic encryption (HE). On inputs of 128 and 512 tokens, it achieves 6.1× and 10.6× inference speedups, respectively, with negligible accuracy degradation. The implementation is open-sourced.
📝 Abstract
Private Transformer inference using cryptographic protocols offers promising solutions for privacy-preserving machine learning; however, it still faces significant runtime overhead (efficiency issues) and challenges in handling long-token inputs (scalability issues). We observe that the Transformer's operational complexity scales quadratically with the number of input tokens, making it essential to reduce the input token length. Notably, each token varies in importance, and many inputs contain redundant tokens. Additionally, prior private inference methods that rely on high-degree polynomial approximations for non-linear activations are computationally expensive. Therefore, reducing the polynomial degree for less important tokens can significantly accelerate private inference. Building on these observations, we propose extit{CipherPrune}, an efficient and scalable private inference framework that includes a secure encrypted token pruning protocol, a polynomial reduction protocol, and corresponding Transformer network optimizations. At the protocol level, encrypted token pruning adaptively removes unimportant tokens from encrypted inputs in a progressive, layer-wise manner. Additionally, encrypted polynomial reduction assigns lower-degree polynomials to less important tokens after pruning, enhancing efficiency without decryption. At the network level, we introduce protocol-aware network optimization via a gradient-based search to maximize pruning thresholds and polynomial reduction conditions while maintaining the desired accuracy. Our experiments demonstrate that CipherPrune reduces the execution overhead of private Transformer inference by approximately $6.1 imes$ for 128-token inputs and $10.6 imes$ for 512-token inputs, compared to previous methods, with only a marginal drop in accuracy. The code is publicly available at https://github.com/UCF-Lou-Lab-PET/cipher-prune-inference.