Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

📅 2026-01-29
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
The causal origins of interpretable units—such as induction heads—in large language models remain poorly understood. This work proposes a scalable mechanistic data attribution framework that integrates influence functions with causal interventions to establish, for the first time, direct causal links between specific training examples and the emergence of such interpretable components. The study reveals that structured repetitive data plays a catalytic role in circuit formation and demonstrates a direct functional relationship between induction heads and in-context learning capabilities. By selectively intervening on a small set of high-influence training samples, the emergence of attention heads can be significantly modulated. Furthermore, the proposed data augmentation strategy consistently accelerates circuit convergence across different model scales.

Technology Category

Application Category

📝 Abstract
While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Mechanistic Interpretability
Data Attribution
Training Data Origins
Induction Heads
In-Context Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mechanistic Interpretability
Influence Functions
Induction Heads
In-Context Learning
Data Attribution
🔎 Similar Papers
No similar papers found.
J
Jianhui Chen
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Y
Yuzhang Luo
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Liangming Pan
Liangming Pan
Assistant Professor, School of Computer Science, Peking University
Natural Language ProcessingLarge Language ModelsMachine Learning