Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work exposes a novel data-extraction threat in fine-tuning open-source large language models (LLMs): model publishers can leverage black-box access to downstream fine-tuned models to reverse-engineer users’ private fine-tuning data via backdoor-assisted training. To this end, we propose the first systematic black-box data extraction attack framework, integrating backdoor injection, gradient steganalysis, and adversarial detection and evasion techniques. Evaluated on open-source LLMs ranging from 3B to 32B parameters, our attack achieves perfect reconstruction rates of 76.3%–94.9% for fine-tuning query data—up to 94.9% in the best case. This is the first demonstration that open-source LLM creators can cost-effectively and reliably exfiltrate downstream private data, thereby revealing a critical gap in fine-tuning security research. To foster community-driven defense development, we publicly release all code and datasets.

Technology Category

Application Category

📝 Abstract

Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers to obtain task-specific LLMs. Surprisingly, we reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data through simple backdoor training, only requiring black-box access to the fine-tuned downstream model. Our comprehensive experiments, across 4 popularly used open-source models with 3B to 32B parameters and 2 downstream datasets, suggest that the extraction performance can be strikingly high: in practical settings, as much as 76.3% downstream fine-tuning data (queries) out of a total 5,000 samples can be perfectly extracted, and the success rate can increase to 94.9% in more ideal settings. We also explore a detection-based defense strategy but find it can be bypassed with improved attack. Overall, we highlight the emergency of this newly identified data breaching risk in fine-tuning, and we hope that more follow-up research could push the progress of addressing this concerning risk. The code and data used in our experiments are released at https://github.com/thu-coai/Backdoor-Data-Extraction.

Problem

Research questions and friction points this paper is trying to address.

Risk of private fine-tuning data extraction from open-source LLMs

Backdoor training enables high extraction success rates

Current detection defenses are ineffective against improved attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Backdoor training extracts private fine-tuning data

Black-box access enables high extraction success

Detection-based defense can be bypassed

🔎 Similar Papers

A Survey of Recent Backdoor Attacks and Defenses in Large Language Models