Evolution of Concepts in Language Model Pre-Training

πŸ“… 2025-09-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study investigates the dynamic evolution of conceptual features during language model pretraining and its mechanistic impact on downstream performance. We propose a fine-grained analytical framework based on cross-coder sparse dictionary learning, enabling the first cross-temporal tracking and attribution of linearly interpretable features across Transformer training stages. Methodologically, we model sequences of pretraining snapshots to quantify the evolution of feature activation patterns, emergence timing, and representational complexity. Our key contributions are threefold: (1) empirical validation and refinement of the two-phase learning theoryβ€”early stages prioritize statistical pattern acquisition, while later stages shift toward high-order semantic feature construction; (2) discovery that ~80% of critical features emerge concentratively during mid-training, with their emergence timeline tightly synchronized with downstream performance gains; (3) establishment of causal links between feature evolution trajectories and generalization capability, providing theoretical foundations and analytical tools for interpretable AI and efficient pretraining.

Technology Category

Application Category

πŸ“ Abstract
Language models obtain extensive capabilities through pre-training. However, the pre-training process remains a black box. In this work, we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer's two-stage learning process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics.
Problem

Research questions and friction points this paper is trying to address.

Tracking interpretable feature evolution during language model pre-training
Understanding causal connections between feature development and model performance
Analyzing fine-grained representation progress across different training stages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Crosscoders track feature evolution snapshots
Feature attribution reveals causal performance connections
Identifies statistical and feature learning phases
πŸ”Ž Similar Papers
No similar papers found.
X
Xuyang Ge
OpenMOSS Team, Shanghai Innovation Institute; Fudan University
W
Wentao Shu
OpenMOSS Team, Shanghai Innovation Institute; Fudan University
J
Jiaxing Wu
OpenMOSS Team, Shanghai Innovation Institute; Fudan University
Yunhua Zhou
Yunhua Zhou
Fudan University
Machine LearningNatural Language Processing
Zhengfu He
Zhengfu He
Shanghai Innovation Institute
Mechanistic InterpretabilityLarge Language Models
X
Xipeng Qiu
OpenMOSS Team, Shanghai Innovation Institute; Fudan University