"Yes, My LoRD."Guiding Language Model Extraction with Locality Reinforced Distillation

📅 2024-09-04

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing Model Extraction Attacks (MEAs) are largely adapted from DNN settings, overlooking the fundamental misalignment between their optimization objectives and those of Large Language Models (LLMs) under alignment training—resulting in low fidelity, high query overhead, and vulnerability to watermarking defenses. This paper proposes LoRD (Locality-Enhanced distillation for Retrieval-based Distillation), the first LLM-tailored MEA framework that reformulates extraction as an alignment-aware policy gradient task; we theoretically prove its optimization trajectory is consistent with LLM alignment dynamics. LoRD introduces response-driven preference construction, exploration-guided querying, and locality-constrained reinforcement distillation to substantially reduce query complexity and mitigate watermark interference. Extensive experiments across multiple state-of-the-art commercial LLMs demonstrate that LoRD significantly outperforms prior methods in fidelity, query efficiency, and robustness against watermarking defenses.

Technology Category

Application Category

📝 Abstract

Model extraction attacks (MEAs) on large language models (LLMs) have received increasing attention in recent research. However, existing attack methods typically adapt the extraction strategies originally developed for deep neural networks (DNNs). They neglect the underlying inconsistency between the training tasks of MEA and LLM alignment, leading to suboptimal attack performance. To tackle this issue, we propose Locality Reinforced Distillation (LoRD), a novel model extraction algorithm specifically designed for LLMs. In particular, LoRD employs a newly defined policy-gradient-style training task that utilizes the responses of victim model as the signal to guide the crafting of preference for the local model. Theoretical analyses demonstrate that I) The convergence procedure of LoRD in model extraction is consistent with the alignment procedure of LLMs, and II) LoRD can reduce query complexity while mitigating watermark protection through our exploration-based stealing. Extensive experiments validate the superiority of our method in extracting various state-of-the-art commercial LLMs. Our code is available at: https://github.com/liangzid/LoRD-MEA.

Problem

Research questions and friction points this paper is trying to address.

Addresses suboptimal performance in model extraction attacks on LLMs.

Introduces Locality Reinforced Distillation for enhanced extraction efficiency.

Reduces query complexity and mitigates watermark protection in attacks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Locality Reinforced Distillation algorithm

Policy-gradient-style training task

Reduced query complexity technique

🔎 Similar Papers

No similar papers found.