How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This study investigates whether internal circuits in language models exhibit task-specificity and consistency, and how such properties inform our understanding of—and ability to intervene on—model behavior. Employing edge attribution patching and component ablation, the authors systematically evaluate causally critical subgraphs within attention heads and MLP layers across six tasks and seven models. Their analysis reveals, for the first time, that circuits within a single task are highly reused and essential for performance, yet circuits across different tasks substantially overlap, with task-exclusive components contributing minimally. This finding challenges the prevailing assumption of task-dedicated circuits and offers a new perspective on model interpretability and targeted intervention.

📝 Abstract

The circuits framework in mechanistic interpretability aims to identify causally important sparse subgraphs of model components, typically evaluated by measuring necessity and sufficiency. We measure circuit reuse, the proportion of components shared across per-example circuits within a task, and investigate two less-studied properties of this: consistency, the recurrence of components within a task, and specificity, their uniqueness to a task. Using edge attribution patching across six tasks and seven models, we find that within-task reuse is high and that shared components are necessary for task performance, with ablations causing up to $\sim$100% relative accuracy drops. However, circuits turn out not to be task-specific: ablating one task's circuit damages another task's performance about as much as that task's own circuit does. We discover that this is due to substantial overlap between circuits across tasks, which are causally important for performance. Some circuits do contain a smaller set of task-specific components, but these account for only a modest portion of circuit performance. Overall, our findings suggest that while circuit discovery at the level of attention heads and MLP layers identifies important components, their lack of task-specificity raises questions about the degree to which circuits can support targeted understanding and intervention on model behavior.

Problem

Research questions and friction points this paper is trying to address.

circuits

consistency

specificity

mechanistic interpretability

language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

circuit reuse

consistency

specificity