Evolution of Concepts in Language Model Pre-Training

📅 2025-09-21

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This study investigates the dynamic evolution of conceptual features during language model pretraining and its mechanistic impact on downstream performance. We propose a fine-grained analytical framework based on cross-coder sparse dictionary learning, enabling the first cross-temporal tracking and attribution of linearly interpretable features across Transformer training stages. Methodologically, we model sequences of pretraining snapshots to quantify the evolution of feature activation patterns, emergence timing, and representational complexity. Our key contributions are threefold: (1) empirical validation and refinement of the two-phase learning theory—early stages prioritize statistical pattern acquisition, while later stages shift toward high-order semantic feature construction; (2) discovery that ~80% of critical features emerge concentratively during mid-training, with their emergence timeline tightly synchronized with downstream performance gains; (3) establishment of causal links between feature evolution trajectories and generalization capability, providing theoretical foundations and analytical tools for interpretable AI and efficient pretraining.

Technology Category

Application Category

📝 Abstract

Language models obtain extensive capabilities through pre-training. However, the pre-training process remains a black box. In this work, we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer's two-stage learning process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics.

Problem

Research questions and friction points this paper is trying to address.

Tracking interpretable feature evolution during language model pre-training

Understanding causal connections between feature development and model performance

Analyzing fine-grained representation progress across different training stages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Crosscoders track feature evolution snapshots

Feature attribution reveals causal performance connections

Identifies statistical and feature learning phases

🔎 Similar Papers

Aligned at the Start: Conceptual Groupings in LLM Embeddings