🤖 AI Summary
This paper introduces the first statistical framework for testing independence among model weights to determine whether two language models originate from independent random initializations. To address both constrained settings—where architecture and training procedure are known—and unconstrained settings—where architectures may differ and adversarial evasion is possible—the authors design two complementary, interpretable, and verifiable tests. The constrained test leverages exchangeability modeling and resampling-based inference to yield rigorous p-value guarantees; the unconstrained test integrates affine-invariant feature alignment, hidden-layer matching, and joint weight/activation similarity metrics to enable architecture-agnostic, adversarially robust detection of local dependencies. Evaluated on 210 pairwise comparisons across 21 open-source models, the framework achieves 100% detection of non-independent relationships. It further precisely identifies the pruning origin of Llama 3.2-3B and uncovers shared layers between Mistral-7B and StripedHyena-7B.
📝 Abstract
We consider the following problem: given the weights of two models, can we test whether they were trained independently -- i.e., from independent random initializations? We consider two settings: constrained and unconstrained. In the constrained setting, we make assumptions about model architecture and training and propose a family of statistical tests that yield exact p-values with respect to the null hypothesis that the models are trained from independent random initializations. These p-values are valid regardless of the composition of either model's training data; we compute them by simulating exchangeable copies of each model under our assumptions and comparing various similarity measures of weights and activations between the original two models versus these copies. We report the p-values from these tests on pairs of 21 open-weight models (210 total pairs) and correctly identify all pairs of non-independent models. Our tests remain effective even if one model was fine-tuned for many tokens. In the unconstrained setting, where we make no assumptions about training procedures, can change model architecture, and allow for adversarial evasion attacks, the previous tests no longer work. Instead, we propose a new test which matches hidden activations between two models, and which is robust to adversarial transformations and to changes in model architecture. The test can also do localized testing: identifying specific non-independent components of models. Though we no longer obtain exact p-values from this, empirically we find it behaves as one and reliably identifies non-independent models. Notably, we can use the test to identify specific parts of one model that are derived from another (e.g., how Llama 3.1-8B was pruned to initialize Llama 3.2-3B, or shared layers between Mistral-7B and StripedHyena-7B), and it is even robust to retraining individual layers of either model from scratch.