$δ$-STEAL: LLM Stealing Attack with Local Differential Privacy

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the vulnerability of watermarking mechanisms for protecting large language model (LLM) intellectual property. We propose δ-STEAL, a novel black-box model stealing attack that—uniquely—incorporates local differential privacy (LDP) into the model stealing setting. Specifically, δ-STEAL injects fine-grained LDP noise into token embeddings during query generation, thereby obfuscating watermark signals and evading watermark detectors deployed by service providers. Experimental results demonstrate that δ-STEAL achieves a 96.95% attack success rate while preserving the functional utility of the stolen model nearly intact. Our work exposes a critical security blind spot in existing LLM watermarking schemes under privacy-aware adversarial settings, underscoring their fragility against stealthy, noise-based evasion strategies. This finding provides both a cautionary insight and a new conceptual direction for designing and rigorously evaluating robust watermarking frameworks for LLM IP protection.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) demonstrate remarkable capabilities across various tasks. However, their deployment introduces significant risks related to intellectual property. In this context, we focus on model stealing attacks, where adversaries replicate the behaviors of these models to steal services. These attacks are highly relevant to proprietary LLMs and pose serious threats to revenue and financial stability. To mitigate these risks, the watermarking solution embeds imperceptible patterns in LLM outputs, enabling model traceability and intellectual property verification. In this paper, we study the vulnerability of LLM service providers by introducing $δ$-STEAL, a novel model stealing attack that bypasses the service provider's watermark detectors while preserving the adversary's model utility. $δ$-STEAL injects noise into the token embeddings of the adversary's model during fine-tuning in a way that satisfies local differential privacy (LDP) guarantees. The adversary queries the service provider's model to collect outputs and form input-output training pairs. By applying LDP-preserving noise to these pairs, $δ$-STEAL obfuscates watermark signals, making it difficult for the service provider to determine whether its outputs were used, thereby preventing claims of model theft. Our experiments show that $δ$-STEAL with lightweight modifications achieves attack success rates of up to $96.95%$ without significantly compromising the adversary's model utility. The noise scale in LDP controls the trade-off between attack effectiveness and model utility. This poses a significant risk, as even robust watermarks can be bypassed, allowing adversaries to deceive watermark detectors and undermine current intellectual property protection methods.

Problem

Research questions and friction points this paper is trying to address.

Bypassing watermark detectors in LLM model stealing attacks

Preserving adversary model utility under local differential privacy

Obfuscating watermark signals to prevent intellectual property claims

Innovation

Methods, ideas, or system contributions that make the work stand out.

Injecting noise into token embeddings for privacy

Using local differential privacy to bypass watermarks

Fine-tuning model with LDP to preserve utility

🔎 Similar Papers

Preserving Privacy in Large Language Models: A Survey on Current Threats and Solutions