Blackbox Model Provenance via Palimpsestic Membership Inference

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This paper addresses model provenance in black-box settings: given an original open-weight language model (Alice) and an unknown derivative model (Bob), how to determine whether Bob is derived from Alice—under either a query setting (with API access to Bob) or an observation setting (given only Bob’s generated text). The authors formulate provenance detection as an independence test between training data order and model outputs, leveraging the phenomenon of *palimpsestic memorization*—i.e., LMs’ stronger retention of later-sequence training examples. They propose a prompt-based likelihood estimation and reordering reconstruction method to enable cross-version likelihood comparison and span-matching analysis. Experiments demonstrate statistically significant detection (p < 10⁻⁸) across 40+ fine-tuned models in the query setting; in the observation setting, high-confidence identification of derivative text is achieved using only数百 tokens.

Technology Category

Application Category

📝 Abstract

Suppose Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice's model to produce text. Can Alice prove that Bob is using her model, either by querying Bob's derivative model (query setting) or from the text alone (observational setting)? We formulate this question as an independence testing problem--in which the null hypothesis is that Bob's model or text is independent of Alice's randomized training run--and investigate it through the lens of palimpsestic memorization in language models: models are more likely to memorize data seen later in training, so we can test whether Bob is using Alice's model using test statistics that capture correlation between Bob's model or text and the ordering of training examples in Alice's training run. If Alice has randomly shuffled her training data, then any significant correlation amounts to exactly quantifiable statistical evidence against the null hypothesis, regardless of the composition of Alice's training data. In the query setting, we directly estimate (via prompting) the likelihood Bob's model gives to Alice's training examples and order; we correlate the likelihoods of over 40 fine-tunes of various Pythia and OLMo base models ranging from 1B to 12B parameters with the base model's training data order, achieving a p-value on the order of at most 1e-8 in all but six cases. In the observational setting, we try two approaches based on estimating 1) the likelihood of Bob's text overlapping with spans of Alice's training examples and 2) the likelihood of Bob's text with respect to different versions of Alice's model we obtain by repeating the last phase (e.g., 1%) of her training run on reshuffled data. The second approach can reliably distinguish Bob's text from as little as a few hundred tokens; the first does not involve any retraining but requires many more tokens (several hundred thousand) to achieve high power.

Problem

Research questions and friction points this paper is trying to address.

Detecting unauthorized derivative models via training data correlation analysis

Testing model independence through palimpsestic memorization patterns

Verifying model provenance using statistical evidence from training sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses training data order correlation for model provenance

Estimates likelihood via prompting in query setting

Compares text likelihood across retrained model versions

🔎 Similar Papers

A General Framework for Data-Use Auditing of ML Models