Approximating Language Model Training Data from Weights

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the challenge of reconstructing training data from publicly released language model weights while the original training corpus remains confidential. We formally define the “weights-to-data” inversion task and propose a gradient-matching–based data retrieval method: leveraging a differentiable gradient alignment metric, it efficiently identifies highly relevant subsets from large-scale public web corpora—without access to the original training data. Our approach integrates weight-space projection, supervised fine-tuning (SFT), and classification-based evaluation. Empirically, it substantially improves downstream performance: AG News classification accuracy rises from 65% to 80%, approaching the expert baseline of 88%; MSMARCO SFT model perplexity drops from 3.3 to 2.3, nearing that of the LLaMA expert model (2.0). This work establishes a novel paradigm for model interpretability, data provenance, and privacy analysis in foundation models.

Technology Category

Application Category

📝 Abstract

Modern language models often have open weights but closed training data. We formalize the problem of data approximation from model weights and propose several baselines and metrics. We develop a gradient-based approach that selects the highest-matching data from a large public text corpus and show its effectiveness at recovering useful data given only weights of the original and finetuned models. Even when none of the true training data is known, our method is able to locate a small subset of public Web documents can be used to train a model to close to the original model performance given models trained for both classification and supervised-finetuning. On the AG News classification task, our method improves performance from 65% (using randomly selected data) to 80%, approaching the expert benchmark of 88%. When applied to a model trained with SFT on MSMARCO web documents, our method reduces perplexity from 3.3 to 2.3, compared to an expert LLAMA model's perplexity of 2.0.

Problem

Research questions and friction points this paper is trying to address.

Approximating training data from language model weights

Recovering useful data using gradient-based selection

Improving model performance with approximated training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-based data selection from public corpus

Recovering useful data from model weights

Improving model performance with selected data

🔎 Similar Papers

Mitigating Memorization In Language Models