Efficient Decoding Methods for Language Models on Encrypted Data

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Homomorphic encryption (HE) severely impedes language model decoding efficiency—existing argmax and sampling methods incur prohibitive computational overhead due to their non-polynomial complexity. Method: We propose CutMax, a novel polynomial-time, differentiable, and globally convergent algorithm for encrypted-domain argmax approximation, and the first HE-compatible nucleus sampling framework. Both leverage polynomial approximations and gradient-driven, sequence-level optimization to enable secure, efficient decoding directly over ciphertexts. Contribution/Results: Evaluated on real large language model outputs, our approach reduces decoding latency by 24×–35× over state-of-the-art baselines, achieving, for the first time, practical-level performance for encrypted text generation. It establishes a scalable, verifiable decoding paradigm for privacy-preserving inference under HE.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) power modern AI applications, but processing sensitive data on untrusted servers raises privacy concerns. Homomorphic encryption (HE) enables computation on encrypted data for secure inference. However, neural text generation requires decoding methods like argmax and sampling, which are non-polynomial and thus computationally expensive under encryption, creating a significant performance bottleneck. We introduce cutmax, an HE-friendly argmax algorithm that reduces ciphertext operations compared to prior methods, enabling practical greedy decoding under encryption. We also propose the first HE-compatible nucleus (top-p) sampling method, leveraging cutmax for efficient stochastic decoding with provable privacy guarantees. Both techniques are polynomial, supporting efficient inference in privacy-preserving settings. Moreover, their differentiability facilitates gradient-based sequence-level optimization as a polynomial alternative to straight-through estimators. We further provide strong theoretical guarantees for cutmax, proving it converges globally to a unique two-level fixed point, independent of the input values beyond the identity of the maximizer, which explains its rapid convergence in just a few iterations. Evaluations on realistic LLM outputs show latency reductions of 24x-35x over baselines, advancing secure text generation.

Problem

Research questions and friction points this paper is trying to address.

Efficient decoding methods for encrypted language models

Reducing computational costs of homomorphic encryption operations

Enabling secure and private text generation inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

HE-friendly argmax algorithm for efficient decoding

First HE-compatible nucleus sampling with privacy

Polynomial techniques enabling gradient-based sequence optimization

🔎 Similar Papers

EmojiPrompt: Generative Prompt Obfuscation for Privacy-Preserving Communication with Cloud-based LLMs