Uncovering Intermediate Variables in Transformers using Circuit Probing

📅 2023-11-07

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Understanding the causal roles and computational mechanisms of intermediate variables—such as syntactic attributes—in Transformer language models remains challenging. Method: We propose *circuit probing*, a hypothesis-driven methodology that reverse-engineers intermediate representations encoding specific linguistic properties and precisely identifies the parameter-level neural circuits supporting them. Our approach integrates gradient-guided discovery, targeted parameter ablation, verification via diagnostic probe training, and modular attribution analysis to enable causal intervention and algorithm-level interpretation. Contribution/Results: This work unifies hypothesis testing, circuit localization, and dynamic tracing for the first time, revealing implicit algorithmic structures within models and their training-time evolution. Experiments successfully decode symbolic arithmetic logic in specialized arithmetic models, localize subject–verb agreement and reflexive pronoun processing circuits in GPT-2, and empirically confirm their progressive emergence during training.

📝 Abstract

Neural network models have achieved high performance on a wide variety of complex tasks, but the algorithms that they implement are notoriously difficult to interpret. It is often necessary to hypothesize intermediate variables involved in a network's computation in order to understand these algorithms. For example, does a language model depend on particular syntactic properties when generating a sentence? Yet, existing analysis tools make it difficult to test hypotheses of this type. We propose a new analysis technique - circuit probing - that automatically uncovers low-level circuits that compute hypothesized intermediate variables. This enables causal analysis through targeted ablation at the level of model parameters. We apply this method to models trained on simple arithmetic tasks, demonstrating its effectiveness at (1) deciphering the algorithms that models have learned, (2) revealing modular structure within a model, and (3) tracking the development of circuits over training. Across these three experiments we demonstrate that circuit probing combines and extends the capabilities of existing methods, providing one unified approach for a variety of analyses. Finally, we demonstrate circuit probing on a real-world use case: uncovering circuits that are responsible for subject-verb agreement and reflexive anaphora in GPT2-Small and Medium.

Problem

Research questions and friction points this paper is trying to address.

Interpret neural network algorithms

Identify intermediate computation variables

Test syntactic property hypotheses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Circuit probing uncovers intermediate variables.

Targeted ablation enables causal analysis.

Method reveals modular structure in models.

🔎 Similar Papers

No similar papers found.