🤖 AI Summary
This work challenges the prevailing assumption in interpretability research that a single task corresponds to a single internal circuit, highlighting that varying prompts can activate distinct mechanisms. To address this, we propose ACC++, an enhanced attention-based causal communication method that efficiently identifies prompt-specific circuits by extracting low-dimensional, interpretable causal signals in a single forward pass. We demonstrate for the first time that language models do not employ a unified circuit for a fixed task—such as indirect object identification (IOI)—but instead utilize multiple circuits clustered into “prompt families,” each sharing similar mechanistic structures. Building on this insight, we develop an automated interpretability pipeline operating at the prompt-family level, which generates accurate, human-understandable mechanistic explanations on GPT-2, Pythia, and Gemma 2 without requiring model substitution or activation patching.
📝 Abstract
Understanding the internal circuits that language models use to solve tasks remains a central challenge in mechanistic interpretability. Most prior work identifies circuits at the task level by averaging across many prompts, implicitly assuming a single stable mechanism per task. We show that this assumption can obscure a crucial source of structure: circuits are prompt-specific, even within a fixed task. Building on attention causal communication (ACC) (Franco&Crovella, 2025), we introduce ACC++, refinements that extract cleaner, lower-dimensional causal signals inside attention heads from a single forward pass. Like ACC, our approach does not require replacement models (e.g., SAEs) or activation patching; ACC++ further improves circuit precision by reducing attribution noise. Applying ACC++ to indirect object identification (IOI) in GPT-2, Pythia, and Gemma 2, we find there is no single circuit for IOI in any model: different prompt templates induce systematically different mechanisms. Despite this variation, prompts cluster into prompt families with similar circuits, and we propose a representative circuit for each family as a practical unit of analysis. Finally, we develop an automated interpretability pipeline that uses ACC++ signals to surface human-interpretable features and assemble mechanistic explanations for prompt families behavior. Together, our results recast circuits as a meaningful object of study by shifting the unit of analysis from tasks to prompts, enabling scalable circuit descriptions in the presence of prompt-specific mechanisms.