Language models' activations linearly encode training-order recency

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This study investigates whether language models linearly encode the temporal order of knowledge acquisition in their hidden-layer activations. We perform stage-wise fine-tuning on Llama-3.2-1B and systematically analyze its hidden-state representations. We find that the model spontaneously organizes sample activation centroids into a highly temporally aligned linear structure within a low-dimensional subspace—centroids lie precisely on a straight line ordered by training sequence. This temporal encoding is invariant to superficial metrics such as activation magnitude, loss, or confidence, and supports robust early-vs.-late knowledge discrimination via a lightweight linear probe (90% accuracy), generalizing to unseen entities. Our key contribution is the first empirical demonstration that large language models intrinsically develop a parseable, linear time-memory mechanism during training—a finding that offers new insights into knowledge evolution and internal representation dynamics.

Technology Category

Application Category

📝 Abstract

We show that language models' activations linearly encode when information was learned during training. Our setup involves creating a model with a known training order by sequentially fine-tuning Llama-3.2-1B on six disjoint but otherwise similar datasets about named entities. We find that the average activations of test samples for the six training datasets encode the training order: when projected into a 2D subspace, these centroids are arranged exactly in the order of training and lie on a straight line. Further, we show that linear probes can accurately (~90%) distinguish "early" vs. "late" entities, generalizing to entities unseen during the probes' own training. The model can also be fine-tuned to explicitly report an unseen entity's training stage (~80% accuracy). Interestingly, this temporal signal does not seem attributable to simple differences in activation magnitudes, losses, or model confidence. Our paper demonstrates that models are capable of differentiating information by its acquisition time, and carries significant implications for how they might manage conflicting data and respond to knowledge modifications.

Problem

Research questions and friction points this paper is trying to address.

Language models encode training order in activations

Linear probes distinguish early versus late learned entities

Models differentiate information by acquisition time

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear encoding of training recency

Sequential fine-tuning on disjoint datasets

Linear probes distinguish temporal acquisition

🔎 Similar Papers

Emergence of a High-Dimensional Abstraction Phase in Language Transformers