A Survey on Model Extraction Attacks and Defenses for Large Language Models

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Model extraction attacks pose severe threats to intellectual property and user privacy in large language model (LLM) deployment. This paper systematically surveys three core threat categories targeting LLMs: functionality stealing, training data extraction, and prompt injection attacks. It categorizes prevailing attack paradigms—including knowledge distillation, query-response analysis, and parameter inference—and reviews corresponding defenses such as output perturbation, access control, and data de-identification. Methodologically, the work introduces a fine-grained attack-defense taxonomy tailored to generative models, along with dedicated evaluation metrics. Results reveal fundamental trade-offs among generalizability, practicality, and security in existing approaches. The study advocates integrating attack-path modeling with adaptive defense mechanisms to enhance robustness without compromising model utility. These contributions provide a rigorous theoretical framework and actionable guidelines for natural language processing, AI security, and real-world LLM deployment.

Technology Category

Application Category

📝 Abstract

Model extraction attacks pose significant security threats to deployed language models, potentially compromising intellectual property and user privacy. This survey provides a comprehensive taxonomy of LLM-specific extraction attacks and defenses, categorizing attacks into functionality extraction, training data extraction, and prompt-targeted attacks. We analyze various attack methodologies including API-based knowledge distillation, direct querying, parameter recovery, and prompt stealing techniques that exploit transformer architectures. We then examine defense mechanisms organized into model protection, data privacy protection, and prompt-targeted strategies, evaluating their effectiveness across different deployment scenarios. We propose specialized metrics for evaluating both attack effectiveness and defense performance, addressing the specific challenges of generative language models. Through our analysis, we identify critical limitations in current approaches and propose promising research directions, including integrated attack methodologies and adaptive defense mechanisms that balance security with model utility. This work serves NLP researchers, ML engineers, and security professionals seeking to protect language models in production environments.

Problem

Research questions and friction points this paper is trying to address.

Surveying model extraction attacks threatening LLM security

Analyzing attack methods like API-based distillation and prompt stealing

Evaluating defense strategies for model and data protection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Taxonomy of LLM-specific extraction attacks and defenses

API-based knowledge distillation and parameter recovery

Adaptive defense mechanisms balancing security and utility

🔎 Similar Papers

Towards More Realistic Extraction Attacks: An Adversarial Perspective