🤖 AI Summary
This work addresses the inefficiency of existing secure inference systems in untrusted cloud environments, where the quadratic complexity of autoregressive generation hinders practical privacy protection for both user prompts and model parameters. To overcome this bottleneck, the authors propose CryptoGen, the first system enabling secure reuse and update of encrypted key-value (KV) caches. By integrating homomorphic encryption with secret sharing, CryptoGen introduces a unified encrypted KV cache framework enhanced with heterogeneous SIMD encoding, optimized ciphertext matrix operations, and novel noise-refreshing and ciphertext-concatenation mechanisms. This design achieves near-linear scaling in privacy-preserving Transformer-based generation. Experimental results on WikiText-2, PTB, and LAMBADA demonstrate that CryptoGen reduces per-token latency by 4.4–7.6× compared to state-of-the-art approaches, with both latency and memory consumption growing nearly linearly with sequence length.
📝 Abstract
The widespread deployment of cloud-hosted generative models raises a fundamental challenge: enabling efficient autoregressive generation while preserving the privacy of both user prompts and model parameters in untrusted environments. We address this challenge in a client-server setting where an untrusted server hosts an autoregressive Transformer and the client requires cryptographic protection for both inputs and inference. We present CryptoGen, the first system to enable scalable privacy-preserving neural generation with persistent encrypted key-value (KV) cache reuse. Discriminative-task secure inference systems incur quadratic latency and memory growth when adapted to autoregressive decoding due to the lack of native encrypted KV-cache support. In contrast, CryptoGen achieves near-linear scaling by securely reusing and updating encrypted KV caches throughout generation. CryptoGen integrates homomorphic encryption and secret sharing to support both prefilling and generation. Key techniques include a unified encrypted KV-cache framework, heterogeneous SIMD encodings for different phases, optimized cipher-cipher matrix-matrix and matrix-vector operations, and efficient noise refresh and ciphertext concatenation mechanisms. Evaluation on generative Transformer models trained on WikiText-2, PTB, and LAMBADA shows that for input lengths of 128-512 tokens, CryptoGen achieves 4.4x-7.6x lower per-token latency than state-of-the-art discriminative secure inference systems, while maintaining near-linear latency and memory scaling, with advantages increasing for longer sequences. CryptoGen is released as an open-source library.