ObfuscaTune: Obfuscated Offsite Fine-tuning and Inference of Proprietary LLMs on Private Datasets

📅 2024-07-03

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This paper addresses the challenge of securely fine-tuning large language models (LLMs) on third-party cloud platforms in cross-entity settings, where both model and data privacy must be preserved. We propose a collaborative architecture that jointly protects model parameters and private training data. Methodologically, we introduce a novel hybrid approach integrating lightweight matrix obfuscation—featuring low-condition-number random transformations—with trusted execution environments (TEEs, e.g., Intel SGX); only 5% of model parameters reside inside the TEE, drastically reducing computational overhead. Evaluated on GPT-2 variants across four NLP benchmark tasks, our method achieves fine-tuning accuracy comparable to local training, with significantly lower error than naive obfuscation baselines. Our key contribution is the first solution enabling *offshore* fine-tuning and inference with *dual confidentiality*: provable secrecy of both model parameters and private data, while maintaining high security, low resource cost, and full functional fidelity.

Technology Category

Application Category

📝 Abstract

This work addresses the timely yet underexplored problem of performing inference and finetuning of a proprietary LLM owned by a model provider entity on the confidential/private data of another data owner entity, in a way that ensures the confidentiality of both the model and the data. Hereby, the finetuning is conducted offsite, i.e., on the computation infrastructure of a third-party cloud provider. We tackle this problem by proposing ObfuscaTune, a novel, efficient and fully utility-preserving approach that combines a simple yet effective obfuscation technique with an efficient usage of confidential computing (only 5% of the model parameters are placed on TEE). We empirically demonstrate the effectiveness of ObfuscaTune by validating it on GPT-2 models with different sizes on four NLP benchmark datasets. Finally, we compare to a na""ive version of our approach to highlight the necessity of using random matrices with low condition numbers in our approach to reduce errors induced by the obfuscation.

Problem

Research questions and friction points this paper is trying to address.

Federated Learning

Privacy Preservation

Language Model Improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

ObfuscaTune

Privacy-Preserving Tuning

Confidentiality Enhancement

🔎 Similar Papers

Preserving Privacy in Large Language Models: A Survey on Current Threats and Solutions