"Yes, My LoRD."Guiding Language Model Extraction with Locality Reinforced Distillation

📅 2024-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Model Extraction Attacks (MEAs) are largely adapted from DNN settings, overlooking the fundamental misalignment between their optimization objectives and those of Large Language Models (LLMs) under alignment training—resulting in low fidelity, high query overhead, and vulnerability to watermarking defenses. This paper proposes LoRD (Locality-Enhanced distillation for Retrieval-based Distillation), the first LLM-tailored MEA framework that reformulates extraction as an alignment-aware policy gradient task; we theoretically prove its optimization trajectory is consistent with LLM alignment dynamics. LoRD introduces response-driven preference construction, exploration-guided querying, and locality-constrained reinforcement distillation to substantially reduce query complexity and mitigate watermark interference. Extensive experiments across multiple state-of-the-art commercial LLMs demonstrate that LoRD significantly outperforms prior methods in fidelity, query efficiency, and robustness against watermarking defenses.

Technology Category

Application Category

📝 Abstract
Model extraction attacks (MEAs) on large language models (LLMs) have received increasing attention in recent research. However, existing attack methods typically adapt the extraction strategies originally developed for deep neural networks (DNNs). They neglect the underlying inconsistency between the training tasks of MEA and LLM alignment, leading to suboptimal attack performance. To tackle this issue, we propose Locality Reinforced Distillation (LoRD), a novel model extraction algorithm specifically designed for LLMs. In particular, LoRD employs a newly defined policy-gradient-style training task that utilizes the responses of victim model as the signal to guide the crafting of preference for the local model. Theoretical analyses demonstrate that I) The convergence procedure of LoRD in model extraction is consistent with the alignment procedure of LLMs, and II) LoRD can reduce query complexity while mitigating watermark protection through our exploration-based stealing. Extensive experiments validate the superiority of our method in extracting various state-of-the-art commercial LLMs. Our code is available at: https://github.com/liangzid/LoRD-MEA.
Problem

Research questions and friction points this paper is trying to address.

Addresses suboptimal performance in model extraction attacks on LLMs.
Introduces Locality Reinforced Distillation for enhanced extraction efficiency.
Reduces query complexity and mitigates watermark protection in attacks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Locality Reinforced Distillation algorithm
Policy-gradient-style training task
Reduced query complexity technique
🔎 Similar Papers
No similar papers found.
Zi Liang
Zi Liang
Hong Kong Polytechnic University
Natural Language ProcessingAI Security
Qingqing Ye
Qingqing Ye
Assistant Professor, The Hong Kong Polytechnic University
data privacy and securityadversarial machine learning
Yanyun Wang
Yanyun Wang
MPhil Student, The Hong Kong University of Science and Technology (Guangzhou)
Adversarial RobustnessAI Security
S
Sen Zhang
The Hong Kong Polytechnic University
Y
Yaxin Xiao
The Hong Kong Polytechnic University
R
Ronghua Li
The Hong Kong Polytechnic University
J
Jianliang Xu
Hong Kong Baptist University
H
Haibo Hu
The Hong Kong Polytechnic University