ObfuscaTune: Obfuscated Offsite Fine-tuning and Inference of Proprietary LLMs on Private Datasets

📅 2024-07-03
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of securely fine-tuning large language models (LLMs) on third-party cloud platforms in cross-entity settings, where both model and data privacy must be preserved. We propose a collaborative architecture that jointly protects model parameters and private training data. Methodologically, we introduce a novel hybrid approach integrating lightweight matrix obfuscation—featuring low-condition-number random transformations—with trusted execution environments (TEEs, e.g., Intel SGX); only 5% of model parameters reside inside the TEE, drastically reducing computational overhead. Evaluated on GPT-2 variants across four NLP benchmark tasks, our method achieves fine-tuning accuracy comparable to local training, with significantly lower error than naive obfuscation baselines. Our key contribution is the first solution enabling *offshore* fine-tuning and inference with *dual confidentiality*: provable secrecy of both model parameters and private data, while maintaining high security, low resource cost, and full functional fidelity.

Technology Category

Application Category

📝 Abstract
This work addresses the timely yet underexplored problem of performing inference and finetuning of a proprietary LLM owned by a model provider entity on the confidential/private data of another data owner entity, in a way that ensures the confidentiality of both the model and the data. Hereby, the finetuning is conducted offsite, i.e., on the computation infrastructure of a third-party cloud provider. We tackle this problem by proposing ObfuscaTune, a novel, efficient and fully utility-preserving approach that combines a simple yet effective obfuscation technique with an efficient usage of confidential computing (only 5% of the model parameters are placed on TEE). We empirically demonstrate the effectiveness of ObfuscaTune by validating it on GPT-2 models with different sizes on four NLP benchmark datasets. Finally, we compare to a na""ive version of our approach to highlight the necessity of using random matrices with low condition numbers in our approach to reduce errors induced by the obfuscation.
Problem

Research questions and friction points this paper is trying to address.

Federated Learning
Privacy Preservation
Language Model Improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

ObfuscaTune
Privacy-Preserving Tuning
Confidentiality Enhancement