π€ AI Summary
This work exposes a novel data-extraction threat in fine-tuning open-source large language models (LLMs): model publishers can leverage black-box access to downstream fine-tuned models to reverse-engineer usersβ private fine-tuning data via backdoor-assisted training. To this end, we propose the first systematic black-box data extraction attack framework, integrating backdoor injection, gradient steganalysis, and adversarial detection and evasion techniques. Evaluated on open-source LLMs ranging from 3B to 32B parameters, our attack achieves perfect reconstruction rates of 76.3%β94.9% for fine-tuning query dataβup to 94.9% in the best case. This is the first demonstration that open-source LLM creators can cost-effectively and reliably exfiltrate downstream private data, thereby revealing a critical gap in fine-tuning security research. To foster community-driven defense development, we publicly release all code and datasets.
π Abstract
Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers to obtain task-specific LLMs. Surprisingly, we reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data through simple backdoor training, only requiring black-box access to the fine-tuned downstream model. Our comprehensive experiments, across 4 popularly used open-source models with 3B to 32B parameters and 2 downstream datasets, suggest that the extraction performance can be strikingly high: in practical settings, as much as 76.3% downstream fine-tuning data (queries) out of a total 5,000 samples can be perfectly extracted, and the success rate can increase to 94.9% in more ideal settings. We also explore a detection-based defense strategy but find it can be bypassed with improved attack. Overall, we highlight the emergency of this newly identified data breaching risk in fine-tuning, and we hope that more follow-up research could push the progress of addressing this concerning risk. The code and data used in our experiments are released at https://github.com/thu-coai/Backdoor-Data-Extraction.