A General Pseudonymization Framework for Cloud-Based LLMs: Replacing Privacy Information in Controlled Text Generation

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In cloud-based large language model (LLM) remote inference, user prompts frequently contain sensitive private entities, leading to severe privacy leakage; existing pretraining or fine-tuning approaches fail to address this issue comprehensively, while inference-time methods are constrained by the necessity of retaining privacy-sensitive information. Method: This paper proposes the first general-purpose pseudonymization framework tailored for cloud LLM remote invocation. It performs real-time privacy entity identification on the client side—integrating rule-based heuristics with lightweight named entity recognition—and applies context-aware, semantics-preserving substitution alongside dynamic token-level rewriting to achieve controllable and reversible anonymization. Contribution/Results: Evaluated across multiple benchmarks, the framework reduces privacy leakage by over 92% while degrading generation accuracy by less than 3%, significantly outperforming baselines. The implementation is publicly open-sourced.

Technology Category

Application Category

📝 Abstract
An increasing number of companies have begun providing services that leverage cloud-based large language models (LLMs), such as ChatGPT. However, this development raises substantial privacy concerns, as users' prompts are transmitted to and processed by the model providers. Among the various privacy protection methods for LLMs, those implemented during the pre-training and fine-tuning phrases fail to mitigate the privacy risks associated with the remote use of cloud-based LLMs by users. On the other hand, methods applied during the inference phrase are primarily effective in scenarios where the LLM's inference does not rely on privacy-sensitive information. In this paper, we outline the process of remote user interaction with LLMs and, for the first time, propose a detailed definition of a general pseudonymization framework applicable to cloud-based LLMs. The experimental results demonstrate that the proposed framework strikes an optimal balance between privacy protection and utility. The code for our method is available to the public at https://github.com/Mebymeby/Pseudonymization-Framework.
Problem

Research questions and friction points this paper is trying to address.

Addressing privacy in cloud-based LLMs
Proposing pseudonymization for user data
Balancing privacy protection and utility
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cloud-based LLMs pseudonymization framework
Balances privacy protection and utility
Publicly available pseudonymization method code
🔎 Similar Papers
No similar papers found.
S
Shilong Hou
College of Application and Technology, Shenzhen University, China
R
Ruilin Shang
College of Application and Technology, Shenzhen University, China
Z
Zi Long
College of Big Data and Internet, Shenzhen Technology University, China
Xianghua Fu
Xianghua Fu
Shenzhen Technology University
Machine LearningNatural Language Processing
Yin Chen
Yin Chen
Lecturer in Mathematics at University of Saskatchewan
Invariant theoryLie theoryCommutative algebraApplied algebraic geometry