🤖 AI Summary
Current interpretability methods for language models rely on manual interpretation of feature functions and lack an automated mechanism for semantic description. This work proposes the first end-to-end automated pipeline that quantifies neuronal functional roles through attribution maps, identifies feature clusters using a novel clustering algorithm, and leverages large language models to generate, simulate, and score natural language explanations, thereby establishing a closed-loop framework for explanation validation. The method achieves, for the first time, fully automatic semantic annotation of attribution maps, successfully reproducing human-interpretable circuits on standard circuit tracing benchmarks and identifying manipulable feature clusters in Llama-3.1-8B-Instruct responsible for generating harmful jailbreak suggestions.
📝 Abstract
In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations underlying some behaviour. However, all prior circuit tracing work has relied on ad-hoc human interpretation of the role that each feature in the circuit plays, via manual inspection of data artifacts such as the dataset examples the component activates on. We introduce \textbf{ADAG}, an end-to-end pipeline for describing these attribution graphs which is fully automated. To achieve this, we introduce \textit{attribution profiles} which quantify the functional role of a feature via its input and output gradient effects. We then introduce a novel clustering algorithm for grouping features, and an LLM explainer--simulator setup which generates and scores natural-language explanations of the functional role of these feature groups. We run our system on known human-analysed circuit-tracing tasks and recover interpretable circuits, and further show ADAG can find steerable clusters which are responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct.