🤖 AI Summary
This paper investigates the identifiability of linear structures—such as morphological vector parallelism—in language models: if a model exhibits a specific linear property, must all models with identical next-token prediction distributions necessarily share that property?
Method: Under a weakened diversity assumption, the authors provide the first complete characterization of distributionally equivalent predictors.
Contribution/Results: They prove that linear properties exhibit “all-or-nothing” identifiability: under reasonable conditions, if one model in a distributional equivalence class satisfies a given linear relation, then *all* (or *none*) of its distributionally equivalent models do so. This result unifies diverse geometric linear phenomena observed in neural language models and establishes the first rigorous, distribution-equivalence-based theoretical foundation for neural-symbolic alignment and model interpretability.
📝 Abstract
We analyze identifiability as a possible explanation for the ubiquity of linear properties across language models, such as the vector difference between the representations of"easy"and"easiest"being parallel to that between"lucky"and"luckiest". For this, we ask whether finding a linear property in one model implies that any model that induces the same distribution has that property, too. To answer that, we first prove an identifiability result to characterize distribution-equivalent next-token predictors, lifting a diversity requirement of previous results. Second, based on a refinement of relational linearity [Paccanaro and Hinton, 2001; Hernandez et al., 2024], we show how many notions of linearity are amenable to our analysis. Finally, we show that under suitable conditions, these linear properties either hold in all or none distribution-equivalent next-token predictors.