On Linear Representations and Pretraining Data Frequency in Language Models

📅 2025-04-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how subject-object co-occurrence frequency in pretraining data affects language models’ (LMs’) ability to linearly represent factual relations. We employ linear probing, causal frequency statistics, and cross-model representation quality regression to analyze this relationship. Our key finding is the existence of a universal frequency threshold—approximately 1k–2k occurrences—beyond which linear representational strength for factual relations sharply increases; this threshold is empirically consistent across OLMo-7B and GPT-J. Building on this, we develop a transferable regression model that accurately infers pretraining co-occurrence frequencies for closed-source LMs with low estimation error, enabling data auditing without access to training corpora. Our core contributions are: (1) identifying a nonlinear criticality in the frequency–representation strength relationship, and (2) introducing the first cross-model, cross-dataset interpretable tool for reverse-engineering training data frequencies.

Technology Category

Application Category

📝 Abstract
Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data's effect on downstream task behavior, we investigate its relationship to LM representations. Previous work has discovered that, in language models, some concepts are encoded `linearly' in the representations, but what factors cause these representations to form? We study the connection between pretraining data frequency and models' linear representations of factual relations. We find evidence that the formation of linear representations is strongly connected to pretraining term frequencies; specifically for subject-relation-object fact triplets, both subject-object co-occurrence frequency and in-context learning accuracy for the relation are highly correlated with linear representations. This is the case across all phases of pretraining. In OLMo-7B and GPT-J, we discover that a linear representation consistently (but not exclusively) forms when the subjects and objects within a relation co-occur at least 1k and 2k times, respectively, regardless of when these occurrences happen during pretraining. Finally, we train a regression model on measurements of linear representation quality in fully-trained LMs that can predict how often a term was seen in pretraining. Our model achieves low error even on inputs from a different model with a different pretraining dataset, providing a new method for estimating properties of the otherwise-unknown training data of closed-data models. We conclude that the strength of linear representations in LMs contains signal about the models' pretraining corpora that may provide new avenues for controlling and improving model behavior: particularly, manipulating the models' training data to meet specific frequency thresholds.
Problem

Research questions and friction points this paper is trying to address.

Investigates link between pretraining data frequency and linear representations in LMs
Explores factors causing linear encoding of factual relations in language models
Develops method to estimate pretraining data properties from model representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Studies linear representations in language models
Links pretraining frequency to linear representations
Predicts pretraining data properties via regression
🔎 Similar Papers
No similar papers found.