Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Understanding the universal behavioral dynamics of autoregressive language models during pretraining remains an open challenge. Method: We conduct a systematic analysis across 1,400+ checkpoints spanning 14M–12B parameters, three model architectures (Transformer, Mamba, RWKV), and two datasets (OpenWebText, The Pile). Contribution/Results: We discover highly consistent behavioral phase transitions across all configurations. Crucially, we show that 98% of behavioral variance is explained by three simple, interpretable heuristics—token frequency, n-gram probability, and semantic similarity—unifying diverse training trajectories. Our findings refute gradual generalization; instead, models undergo staged overfitting to increasingly higher-order local statistical patterns (e.g., from unigrams to trigrams). This constitutes the first cross-architecture, cross-scale, and cross-dataset universal behavioral paradigm for large language model training dynamics.

Technology Category

Application Category

📝 Abstract
We show that across architecture (Transformer vs. Mamba vs. RWKV), training dataset (OpenWebText vs. The Pile), and scale (14 million parameters to 12 billion parameters), autoregressive language models exhibit highly consistent patterns of change in their behavior over the course of pretraining. Based on our analysis of over 1,400 language model checkpoints on over 110,000 tokens of English, we find that up to 98% of the variance in language model behavior at the word level can be explained by three simple heuristics: the unigram probability (frequency) of a given word, the $n$-gram probability of the word, and the semantic similarity between the word and its context. Furthermore, we see consistent behavioral phases in all language models, with their predicted probabilities for words overfitting to those words' $n$-gram probabilities for increasing $n$ over the course of training. Taken together, these results suggest that learning in neural language models may follow a similar trajectory irrespective of model details.
Problem

Research questions and friction points this paper is trying to address.

Language models show consistent behavioral patterns across architectures and scales
Word-level variance explained by frequency, n-gram probability, and semantic similarity
Models exhibit consistent overfitting phases to n-gram probabilities during training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language models show consistent behavioral phases across architectures
Three simple heuristics explain 98% of word-level variance
Models overfit to n-gram probabilities during training progression
🔎 Similar Papers
No similar papers found.
James A. Michaelov
James A. Michaelov
Massachusetts Institute of Technology
Cognitive ScienceLinguistics
R
Roger P. Levy
Department of Brain and Cognitive Sciences, MIT
B
Benjamin K. Bergen
Deparmtent of Cognitive Science, UCSD