Differentially Private and Communication Efficient Large Language Model Split Inference via Stochastic Quantization and Soft Prompt

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the privacy risks inherent in large language model (LLM) inference, where user queries must be transmitted to remote servers, potentially exposing sensitive data. Existing privacy-preserving approaches often incur high communication overhead and rely on locally available models, limiting their practicality. To overcome these challenges, the authors propose DEL, a novel framework that leverages embedding projection combined with differentially private stochastic quantization to substantially reduce communication costs. DEL uniquely introduces a soft prompt mechanism deployed on the server side to mitigate the utility degradation caused by privacy-preserving perturbations. Notably, this approach enables efficient, privacy-preserving split inference without requiring a local model. Experiments on text generation and natural language understanding tasks demonstrate that DEL achieves strong differential privacy guarantees while maintaining low communication overhead and high model utility.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved remarkable performance and received significant research interest. The enormous computational demands, however, hinder the local deployment on devices with limited resources. The current prevalent LLM inference paradigms require users to send queries to the service providers for processing, which raises critical privacy concerns. Existing approaches propose to allow the users to obfuscate the token embeddings before transmission and utilize local models for denoising. Nonetheless, transmitting the token embeddings and deploying local models may result in excessive communication and computation overhead, preventing practical implementation. In this work, we propose \textbf{DEL}, a framework for \textbf{D}ifferentially private and communication \textbf{E}fficient \textbf{L}LM split inference. More specifically, an embedding projection module and a differentially private stochastic quantization mechanism are proposed to reduce the communication overhead in a privacy-preserving manner. To eliminate the need for local models, we adapt soft prompt at the server side to compensate for the utility degradation caused by privacy. To the best of our knowledge, this is the first work that utilizes soft prompt to improve the trade-off between privacy and utility in LLM inference, and extensive experiments on text generation and natural language understanding benchmarks demonstrate the effectiveness of the proposed method.

Problem

Research questions and friction points this paper is trying to address.

Differential Privacy

Communication Efficiency

Large Language Models

Split Inference

Privacy-Preserving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentially Private

Communication Efficient

Soft Prompt