Extracting alignment data in open models

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of scarce alignment data for post-trained open models. We propose a semantic similarity–based data extraction method: leveraging high-quality embedding models to measure semantic distances between model outputs and original supervised fine-tuning (SFT) or reinforcement learning (RL) training data, thereby identifying and reconstructing latent alignment samples. Our approach reveals that knowledge distillation may implicitly reproduce original training data—a previously underappreciated privacy and copyright risk—and achieves, for the first time, scalable, high-fidelity alignment data extraction from black-box post-trained models. Experiments demonstrate that fine-tuning base models solely on extracted data significantly restores performance across critical dimensions—including long-context reasoning, safety, instruction following, and mathematical reasoning—validating the feasibility of effective reverse engineering and reuse of alignment data. This establishes a novel paradigm for model interpretability, data provenance, and safety evaluation.

Technology Category

Application Category

📝 Abstract
In this work, we show that it is possible to extract significant amounts of alignment training data from a post-trained model -- useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths. While the majority of related work on memorisation has focused on measuring success of training data extraction through string matching, we argue that embedding models are better suited for our specific goals. Distances measured through a high quality embedding model can identify semantic similarities between strings that a different metric such as edit distance will struggle to capture. In fact, in our investigation, approximate string matching would have severely undercounted (by a conservative estimate of $10 imes$) the amount of data that can be extracted due to trivial artifacts that deflate the metric. Interestingly, we find that models readily regurgitate training data that was used in post-training phases such as SFT or RL. We show that this data can be then used to train a base model, recovering a meaningful amount of the original performance. We believe our work exposes a possibly overlooked risk towards extracting alignment data. Finally, our work opens up an interesting discussion on the downstream effects of distillation practices: since models seem to be regurgitating aspects of their training set, distillation can therefore be thought of as indirectly training on the model's original dataset.
Problem

Research questions and friction points this paper is trying to address.

Extracting alignment training data from post-trained models
Using embedding models to identify semantic data similarities
Investigating risks of data regurgitation in distillation practices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using embedding models for semantic data extraction
Extracting alignment training data from post-trained models
Applying extracted data to train base models effectively