🤖 AI Summary
This work investigates the **causal role**—not merely correlation—of Transformer attention heads in task performance. To this end, we propose **Causal Head Gating (CHG)**, a fully data-driven, soft-gating framework that performs scalable causal attribution per head (facilitating, interfering, or neutral) without requiring predefined hypotheses, prompt templates, or human annotations. Methodologically, CHG integrates soft-gating optimization, causal mediation analysis, and contrastive ablation, enabling the first automated classification of attention heads by causal function and identification of task-specific sparse subcircuits. Experiments on the Llama-3 family demonstrate that CHG reliably identifies causally critical heads; reveals distinct mechanisms for instruction following versus in-context learning; and discovers subcircuits that generalize across syntactic, commonsense, and mathematical reasoning tasks. Moreover, head-level dependencies exhibit low modularity, suggesting distributed, non-local functional organization.
📝 Abstract
We present causal head gating (CHG), a scalable method for interpreting the functional roles of attention heads in transformer models. CHG learns soft gates over heads and assigns them a causal taxonomy - facilitating, interfering, or irrelevant - based on their impact on task performance. Unlike prior approaches in mechanistic interpretability, which are hypothesis-driven and require prompt templates or target labels, CHG applies directly to any dataset using standard next-token prediction. We evaluate CHG across multiple large language models (LLMs) in the Llama 3 model family and diverse tasks, including syntax, commonsense, and mathematical reasoning, and show that CHG scores yield causal - not merely correlational - insight, validated via ablation and causal mediation analyses. We also introduce contrastive CHG, a variant that isolates sub-circuits for specific task components. Our findings reveal that LLMs contain multiple sparse, sufficient sub-circuits, that individual head roles depend on interactions with others (low modularity), and that instruction following and in-context learning rely on separable mechanisms.