Finding Highly Interpretable Prompt-Specific Circuits in Language Models

📅 2026-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work challenges the prevailing assumption in interpretability research that a single task corresponds to a single internal circuit, highlighting that varying prompts can activate distinct mechanisms. To address this, we propose ACC++, an enhanced attention-based causal communication method that efficiently identifies prompt-specific circuits by extracting low-dimensional, interpretable causal signals in a single forward pass. We demonstrate for the first time that language models do not employ a unified circuit for a fixed task—such as indirect object identification (IOI)—but instead utilize multiple circuits clustered into “prompt families,” each sharing similar mechanistic structures. Building on this insight, we develop an automated interpretability pipeline operating at the prompt-family level, which generates accurate, human-understandable mechanistic explanations on GPT-2, Pythia, and Gemma 2 without requiring model substitution or activation patching.

Technology Category

Application Category

📝 Abstract
Understanding the internal circuits that language models use to solve tasks remains a central challenge in mechanistic interpretability. Most prior work identifies circuits at the task level by averaging across many prompts, implicitly assuming a single stable mechanism per task. We show that this assumption can obscure a crucial source of structure: circuits are prompt-specific, even within a fixed task. Building on attention causal communication (ACC) (Franco&Crovella, 2025), we introduce ACC++, refinements that extract cleaner, lower-dimensional causal signals inside attention heads from a single forward pass. Like ACC, our approach does not require replacement models (e.g., SAEs) or activation patching; ACC++ further improves circuit precision by reducing attribution noise. Applying ACC++ to indirect object identification (IOI) in GPT-2, Pythia, and Gemma 2, we find there is no single circuit for IOI in any model: different prompt templates induce systematically different mechanisms. Despite this variation, prompts cluster into prompt families with similar circuits, and we propose a representative circuit for each family as a practical unit of analysis. Finally, we develop an automated interpretability pipeline that uses ACC++ signals to surface human-interpretable features and assemble mechanistic explanations for prompt families behavior. Together, our results recast circuits as a meaningful object of study by shifting the unit of analysis from tasks to prompts, enabling scalable circuit descriptions in the presence of prompt-specific mechanisms.
Problem

Research questions and friction points this paper is trying to address.

mechanistic interpretability
prompt-specific circuits
language models
circuit analysis
indirect object identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt-specific circuits
ACC++
mechanistic interpretability
attention causal communication
automated interpretability pipeline
🔎 Similar Papers
No similar papers found.
G
Gabriel Franco
Department of Computer Science, Boston University, Boston, USA
L
Lucas M. Tassis
Department of Computer Science, Boston University, Boston, USA
A
Azalea Rohr
Faculty of Computing & Data Sciences, Boston University, Boston, USA
Mark Crovella
Mark Crovella
Professor of Computer Science and Computing & Data Sciences, Boston University
NetworkingSystemsData MiningStatisticsPerformance Evaluation